zoukankan      html  css  js  c++  java
  • 【甘道夫】并行化频繁模式挖掘算法FP Growth及其在Mahout下的命令使用

            今天调研了并行化频繁模式挖掘算法PFP Growth及其在Mahout下的命令使用,简单记录下试验结果,供以后查阅:

    • 环境:Jdk1.7 + Hadoop2.2.0单机伪集群 +  Mahout0.6(0.8和0.9版本号都不包括该算法。Mahout0.6能够和Hadoop2.2.0和平共处有点意外orz)
    • 部分输入数据,输入数据一行代表一个购物篮:

    4750,19394,25651,6395,5592
    26180,10895,24571,23295,20578,27791,2729,8637
    7380,18805,25086,19048,3190,21995,10908,12576
    3458,12426,20578
    1880,10702,1731,5185,18575,28967
    21815,10872,18730
    20626,17921,28930,14580,2891,11080
    18075,6548,28759,17133
    7868,15200,13494
    7868,28617,18097,22999,16323,8637,7045,25733
    12189,8816,22950,18465,13258,27791,20979
    26728
    17512,14821,18741
    26619,14470,21899,6731
    5184
    28653,28662,18353,27437,5661,12078,11849,15784,7248,7061,18612,24277,4807,15584,9671,18741,3647,1000

    。。

    • 运行命令:

    mahout fpg -i /workspace/dataguru/hadoopdev/week13/fpg/in/ -o /workspace/dataguru/hadoopdev/week13/fpg/out -method mapreduce -s 3

    參数说明:

    -i 输入路径,因为执行在hadoop环境中,所以输入路径必须是hdfs路径,实验的输入路径是/workspace/dataguru/hadoopdev/week13/fpg/in/user2items.csv

    -o 输出路径,指定hdfs中的输出路径

    完整參数说明參见下表:


    • 命令运行以后的输出文件夹:

    casliyang@singlehadoop:~$ hadoop dfs -ls /workspace/dataguru/hadoopdev/week13/fpg/out
    DEPRECATED: Use of this script to execute hdfs command is deprecated.
    Instead use the hdfs command for it.
    Found 4 items
    -rw-r--r--   3 casliyang supergroup       5567 2014-06-17 17:50 /workspace/dataguru/hadoopdev/week13/fpg/out/fList
    drwxr-xr-x   - casliyang supergroup          0 2014-06-17 17:51 /workspace/dataguru/hadoopdev/week13/fpg/out/fpgrowth
    drwxr-xr-x   - casliyang supergroup          0 2014-06-17 17:51 /workspace/dataguru/hadoopdev/week13/fpg/out/frequentpatterns
    drwxr-xr-x   - casliyang supergroup          0 2014-06-17 17:50 /workspace/dataguru/hadoopdev/week13/fpg/out/parallelcounting


    挖掘出来的频繁模式在frequentpatterns目录下

    casliyang@singlehadoop:~$ hadoop dfs -ls /workspace/dataguru/hadoopdev/week13/fpg/out/frequentpatterns
    DEPRECATED: Use of this script to execute hdfs command is deprecated.
    Instead use the hdfs command for it.
    Found 2 items
    -rw-r--r--   3 casliyang supergroup          0 2014-06-17 17:51 /workspace/dataguru/hadoopdev/week13/fpg/out/frequentpatterns/_SUCCESS
    -rw-r--r--   3 casliyang supergroup      10017 2014-06-17 17:51 /workspace/dataguru/hadoopdev/week13/fpg/out/frequentpatterns/part-r-00000


    该文件是序列化文件,不能直接查看,mahout提供了命令能够将其转换为普通文本:

    mahout seqdumper -s /workspace/dataguru/hadoopdev/week13/fpg/out/frequentpatterns/part-r-00000 -o /home/casliyang/outpattern

    这里要注意。-o指定的输出文件路径必须是linux文件系统。而且目标文件必须提前创建好,否则会报错。

    • 终于输出到/home/casliyang/outpattern的部分结果

    Key: 29099: Value: ([29099],18), ([29099, 4479],3)
    Key: 29202: Value: ([29202],3)
    Key: 29203: Value: ([29203],9), ([14020, 29203],3)
    Key: 29224: Value: ([29224],3)
    Key: 29547: Value: ([29547],5)
    Key: 2963: Value: ([2963],8), ([2963, 21146],3)
    Key: 2999: Value: ([2999],3)
    Key: 3032: Value: ([3032],4)
    Key: 3047: Value: ([3047],4)
    Key: 3151: Value: ([3151],7), ([14020, 3151],4)
    Key: 3181: Value: ([3181],3)
    Key: 3228: Value: ([3228],14)
    Key: 3313: Value: ([3313],3)
    Key: 3324: Value: ([3324],3)
    Key: 3438: Value: ([3438],3)
    Key: 3458: Value: ([3458],4)
    Key: 3627: Value: ([3627],11), ([3627, 11176],3)

    。。。。

    含义:

    Key:itemid

    Value:和该item相关的频繁模式及其支持度


    有了挖掘出来的频繁模式。就能够进一步用程序依据业务需求做处理了。

    Mahout真是个伟大的开源项目。


  • 相关阅读:
    VUE学习一,安装及Hello World
    609. 在系统中查找重复文件
    451. 根据字符出现频率排序
    面试题 10.02. 变位词组
    142. 环形链表 II
    面试题 16.24. 数对和
    151. 翻转字符串里的单词
    1207. 独一无二的出现次数
    80. 删除排序数组中的重复项 II
    1365. 有多少小于当前数字的数字
  • 原文地址:https://www.cnblogs.com/yjbjingcha/p/6887827.html
Copyright © 2011-2022 走看看