zoukankan      html  css  js  c++  java
  • [Spark][python]从 web log 中提取出 UserID 作为key 值,形成新的 RDD

    针对RDD, 使用 keyBy 来构筑 key-line 对:


    [training@localhost ~]$ cat webs.log

    56.31.230.188 - 90700 "GET/KDDOC-00101.html HTTP/1.0"
    56.32.230.186 - 90700 "GET/contents.css HTTP/1.0"
    202.156.27.99 - 25223 "GET /KDDOC-00220.html HTTP/1.0"
    [training@localhost ~]$
    [training@localhost ~]$ hdfs dfs -put webs.log
    [training@localhost ~]$
    [training@localhost ~]$ hdfs dfs -cat webs.log
    56.31.230.188 - 90700 "GET/KDDOC-00101.html HTTP/1.0"
    56.32.230.186 - 90700 "GET/contents.css HTTP/1.0"
    202.156.27.99 - 25223 "GET /KDDOC-00220.html HTTP/1.0"
    [training@localhost ~]$
    [training@localhost ~]$

    In [23]: mylogs = sc.textFile("webs.log")

    In [25]: mylogs001 = mylogs.keyBy(lambda line: line.split(' ')[2])

    In [26]: mylogs001.take(1)
    Out[26]: [(u'90700', u'56.31.230.188 - 90700 "GET/KDDOC-00101.html HTTP/1.0"')]

    In [28]: mylogs001.take(2)
    Out[28]:
    [(u'90700', u'56.31.230.188 - 90700 "GET/KDDOC-00101.html HTTP/1.0"'),
    (u'90700', u'56.32.230.186 - 90700 "GET/contents.css HTTP/1.0"')]


    作一个对比,看看 mylogs001.take(3) 和 mylogs.take(3)

    In [30]: mylogs001.take(3)
    Out[30]:
    [(u'90700', u'56.31.230.188 - 90700 "GET/KDDOC-00101.html HTTP/1.0"'),
    (u'90700', u'56.32.230.186 - 90700 "GET/contents.css HTTP/1.0"'),
    (u'25223', u'202.156.27.99 - 25223 "GET /KDDOC-00220.html HTTP/1.0"')]


    In [31]: mylogs.take(3)
    Out[31]:
    [u'56.31.230.188 - 90700 "GET/KDDOC-00101.html HTTP/1.0"',
    u'56.32.230.186 - 90700 "GET/contents.css HTTP/1.0"',
    u'202.156.27.99 - 25223 "GET /KDDOC-00220.html HTTP/1.0"']

  • 相关阅读:
    HTML5 drag拖动事件
    echarts 实现立体柱子图
    团队管理(七)
    echarts环比图实现
    父组件调用图表组件根据按钮切换展示数据
    echarts 折柱图绘制图表标注
    团队管理(六)
    团队管理(五)
    css 绘制圆角三角形
    团队管理(四)
  • 原文地址:https://www.cnblogs.com/gaojian/p/008-Aggregating-Data-with-Pair-RDDs-keyBy.html
Copyright © 2011-2022 走看看