zoukankan html css js c++ java

天龙八部谁是主角？（MR词频统计）

天龙八部主要是对段誉、萧峰、虚竹三人的描写，那么谁才是真正的主角呢？这次姑且认为小说中谁的出现次数多谁是主角。

实验在linux环境下

首先下载天龙八部
wget http://labfile.oss.aliyuncs.com/hadoop/tlbbtestfile.txt
安装结巴分词
sudo pip install jieba
hdfs dfs -put tlbbtestfile.txt /tlbb.txt

# 创建代码文件夹
mkdir tlbbwordcount
# 创建 Mapper 程序文件
touch tlbbwordcount/mapper.py
# 创建 Reducer 程序文件
touch tlbbwordcount/reducer.py
# 给所有 Python 脚本增加可执行权限
chmod a+x tlbbwordcount/*.py

mapper程序：

 1 #!/usr/bin/env python
 2 
 3 # 引入 jieba 分词模块
 4 import jieba
 5 import sys
 6 
 7 # 从 stdin 标准输入中依次读取每一行
 8 for line in sys.stdin:
 9 
10      # 对每一行使用 jieba 分词进行分词
11     wlist = jieba.cut(line.strip())
12 
13     # 对分词得到的词汇列表进行 Map 操作
14     for word in wlist:
15         try:
16               # 每个词都映射成（word，1）这样的二元组
17               # 输出到标准输出 stdout 中
18             print "%s	1" % (word.encode("utf8"))
19         except:
20             pass

reducer程序：

#!/usr/bin/env python
import sys

# 定义临时变量存储中间数据
current_word, current_count, word=None,1,None

# 依次从标准输入读取每一行
for line in sys.stdin:
    try:
          # 每一行都是一个（word，count）的二元组，从中提取信息词语和数量
        line = line.rstrip()
        word, count = line.split("	", 1)
        count = int(count)
    except: continue

    # 判断当前处理的词是从当前行提取的词
    if current_word == word:
         # 如果是，则增加当前处理的词出现的频次
        current_count += count
    else:
        # 如果不是，则需要输出当前处理的词和词频到标准输出
        if current_word:
            print "%s	%u" % (current_word, current_count)
        current_count, current_word = count, word

# 读取完毕后需要处理当前词是读取词，但没有输出的情况
if current_word == word:
    print "%s	%u" % (current_word, current_count)

执行任务：

hadoop jar /opt/hadoop-2.6.1/share/hadoop/tools/lib/hadoop-streaming-2.6.1.jar -mapper mapper.py -reducer reducer.py -input /tlbb.txt -output tlbbout -jobconf mapred.map.tasks=4 -jobconf mapred.reduce.tasks=2

结果：

实验地址：

https://www.shiyanlou.com/courses/40/labs/305/document

查看全文

相关阅读:
MySQL数据表类型 = 存储引擎类型
 删除链表节点
 链表逆序（反转）
腾讯2012笔试题
 MysqL数据表类型
 进程间的通信方式
 网络套接字编程学习笔记一
 HTTP报头
 C语言排序算法
 交换排序经典的冒泡排序算法总结

原文地址：https://www.cnblogs.com/mycd/p/7865462.html