zoukankan      html  css  js  c++  java
  • 理解MapReduce计算构架

    用Python编写WordCount程序任务

    程序

    WordCount

    输入

    一个包含大量单词的文本文件

    输出

    文件中每个单词及其出现次数(频数),并按照单词字母顺序排序,每个单词和其频数占一行,单词和频数之间有间隔

    1. 编写map函数,reduce函数
    2. 将其权限作出相应修改
    3. 本机上测试运行代码
    4. 放到HDFS上运行
    5. 下载并上传文件到hdfs上
    6. 用Hadoop Streaming命令提交任务
      create 'Student', ' S_No  ','S_Name', 'S_Sex','S_Age'
      
      put 'Student','s001','S_No','2015001'
      put 'Student','s001','S_Name','Zhangsan'
      put 'Student','s001','S_Sex','male'
      put 'Student','s001','S_Age','23'
      
      put 'Student','s002','S_No','2015003'
      put 'Student','s002','S_Name','Mary'
      put 'Student','s002','S_Sex','female'
      put 'Student','s002','S_Age','22'
      
      put 'Student','s003','S_No','2015003'
      put 'Student','s003','S_Name','Lisi'
      put 'Student','s003','S_Sex','male'
      put 'Student','s003','S_Age','24'
      

        

      scan 'Student'
      alter 'Student','NAME'=>'course'
      put 'Student','3','course:Math','85'
      dorp 'Student','course'
      count 's1'
      truncate 's1'
      

        

      cd /home/hadoop/wc
      sudo gedit mapper.py
      
      # map函数
      import sys
      for i in stdin:
          i = i.strip()
          words = i.split()
          for word in words:
          print '%s	%s' % (word,1)
      
      #reduce函数
      from operator import itemgetter
      import sys
      
      current_word = None
      current_count = 0
      word = None
      
      for i in stdin:
          i = i.strip()
          word, count = i.split('	',1)
          try:
          count = int(count)
          except ValueError:
          continue
      
          if current_word == word:
          current_count += count 
          else:
          if current_word:
              print '%s	%s' % (current_word, current_count)
          current_count = count
          current_word = word
      
      if current_word == word:
          print '%s	%s' % (current_word, current_count)
      

        

      chmod a+x /home/hadoop/mapper.py
      

        

      echo "foo foo quux labs foo bar quux" | /home/hadoop/wc/mapper.py
      
      echo "foo foo quux labs foo bar quux" | /home/hadoop/wc/mapper.py | sort -k1,1 | /home/hadoop/wc/reducer.p
      

        

      cd  /home/hadoop/wc
      wget http://www.gutenberg.org/files/5000/5000-8.txt
      wget http://www.gutenberg.org/cache/epub/20417/pg20417.txt
      
      
      cd /usr/hadoop/wc
      hdfs dfs -put /home/hadoop/hadoop/gutenberg/*.txt /user/hadoop/input
      

        

  • 相关阅读:
    Java Web 网络留言板2 JDBC数据源 (连接池技术)
    Java Web 网络留言板3 CommonsDbUtils
    Java Web ConnectionPool (连接池技术)
    Java Web 网络留言板
    Java Web JDBC数据源
    Java Web CommonsUtils (数据库连接方法)
    Servlet 起源
    Hibernate EntityManager
    Hibernate Annotation (Hibernate 注解)
    wpf控件设计时支持(1)
  • 原文地址:https://www.cnblogs.com/Runka/p/9026668.html
Copyright © 2011-2022 走看看