zoukankan      html  css  js  c++  java
  • hive进行词频统计

    统计文件信息:

    $ /opt/cdh-5.3.6/hadoop-2.5.0/bin/hdfs dfs -text /user/hadoop/wordcount/input/wc.input
    hadoop spark
    spark hadoop
    oracle mysql postgresql
    postgresql oracle mysql
    mysql mongodb
    hdfs yarn mapreduce
    yarn hdfs
    zookeeper

    针对于以上文件使用hive做词频统计:

    create table docs (line string);

    load data inpath '/user/hadoop/wordcount/input/wc.input' into table docs;

    create table word_counts as
    select word,count(1) as count from
    (select explode(split(line,' ')) as word from docs) word
    group by word
    order by word;

    分段解释:

    --使用split函数对表中行按空格进行分隔:

    select split(line,' ') from docs;
    ["hadoop","spark",""]
    ["spark","hadoop"]
    ["oracle","mysql","postgresql"]
    ["postgresql","oracle","mysql"]
    ["mysql","mongodb"]
    ["hdfs","yarn","mapreduce"]
    ["yarn","hdfs"]
    ["zookeeper"]

    --使用explode函数对split的结果集进行行拆列:

    select explode(split(line,' ')) as word from docs;
    word
    hadoop
    spark

    spark
    hadoop
    oracle
    mysql
    postgresql
    postgresql
    oracle
    mysql
    mysql
    mongodb
    hdfs
    yarn
    mapreduce
    yarn
    hdfs
    zookeeper

    --以上输出内容已经满足对其做统计分析,这时通过sql对其进行分析:

    select word,count(1) as count from
    (select explode(split(line,' ')) as word from docs) word
    group by word
    order by word;

    word    count
         1
    hadoop    2
    hdfs    2
    mapreduce    1
    mongodb    1
    mysql    3
    oracle    2
    postgresql    2
    spark    2
    yarn    2
    zookeeper    1

  • 相关阅读:
    IOS:APP网络状态的检测
    IOS:个人笔记|UI__使用Plist文件来进行数据的读取
    IntelliJ IDEA中项目import与open的区别
    打开电脑分屏
    微服务
    自增主键和UUID
    雪花算法
    使用navicat连接阿里云上mysql
    【vue】报错This dependency was not found
    跨域问题
  • 原文地址:https://www.cnblogs.com/wcwen1990/p/7116041.html
Copyright © 2011-2022 走看看