zoukankan      html  css  js  c++  java
  • MapReduce实现协同过滤中每个用户看过的项目集合

    一、知识准备

      hadoop自带的例子在

      D:HADOOP_HOMEhadoop-2.6.4sharehadoopmapreducesourceshadoop-mapreduce-examples 2.6.0-source.jar

      我记得当年面试的时候就问中位数的问题不过是数据流下的中位数,一问便知是否搞过hadoop。

    二、代码实现

    2.1 Mapper

    package cf;
    
    import java.io.IOException;
    
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Mapper;
    
    public class MovieMapper1 extends Mapper<LongWritable, Text, Text, Text> {
    
    	public void map(LongWritable ikey, Text ivalue, Context context)
    			throws IOException, InterruptedException {
    			String[] values = ivalue.toString().split(",");
    			if (values.length!=2) {
    				return ;
    			}
    			String userID = values[0];
    			String itemID = values[1];
    			context.write(new Text(userID), new Text(itemID));
    	}
    }
    

      

    2.2 Reducer

    package cf;
    
    import java.io.IOException;
    
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Reducer;
    
    public  class MovieReduce1 extends Reducer<Text, Text, Text, Text> {
    
    	public void reduce(Text _key, Iterable<Text> values, Context context)
    			throws IOException, InterruptedException {
    		// process values
    		StringBuffer sb = new StringBuffer();	
    		for (Text val : values) {
    			sb.append(val.toString());
    			sb.append(",");
    		}
    		//value不能直接用StringBuffer  必须转换为String
    		context.write(_key,new Text(sb.toString()));
    	}
    
    }
    

    2.3 Main

    package cf;
    
    import java.io.IOException;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    
    public class UserItemSetMapReduce {
    
    	public static void main(String[] args) throws Exception{
    			
    		Configuration conf = new Configuration();
    		Job job = new Job(conf, "CFItemSet");
    		job.setJarByClass(UserItemSetMapReduce.class);
    		job.setMapperClass(MovieMapper1.class);
    		//job.setCombinerClass(cls);
    //		job.setCombinerClass(MovieReduce1.class);
    		job.setReducerClass(MovieReduce1.class);
    		job.setOutputKeyClass(Text.class);
    		job.setOutputValueClass(Text.class);
    		FileInputFormat.addInputPath(job,new Path("hdfs://192.168.58.180:8020/cf/userItem.txt"));
    		//InputPath(job, new Path(otherArgs[0]));
    		//直接写到cf会提示已存在cf,我写成uIO.ttx,以为内容会写入到txt,然没有,默认他是文件夹
    		FileOutputFormat.setOutputPath(job,new Path("hdfs://192.168.58.180:8020/cf/userItemOut.txt"));
    		System.exit(job.waitForCompletion(true) ? 0 : 1);
    	}
    }
    

      

    三、结果分析

    3.1 输入

    3.2 输出

    查看结果发现输出文件的分隔符默认是tab,‘ ’,同时相对于输入文件来说输出结果是逆着的,类似沾,莫非context就是这样的先进后出、

    3.3日志分析

    只列出了主要部分的日志

     
     DEBUG - PrivilegedAction as:hxsyl (auth:SIMPLE) from:org.apache.hadoop.mapreduce.Job.getCounters(Job.java:765)
      INFO - Counters: 38
    	File System Counters
    		FILE: Number of bytes read=538
    		FILE: Number of bytes written=509366
    		FILE: Number of read operations=0
    		FILE: Number of large read operations=0
    		FILE: Number of write operations=0
    		HDFS: Number of bytes read=106
    		HDFS: Number of bytes written=37
    		HDFS: Number of read operations=13
    		HDFS: Number of large read operations=0
    		HDFS: Number of write operations=4
    	Map-Reduce Framework
    		Map input records=11
    		Map output records=11
    		Map output bytes=44
    		Map output materialized bytes=72
    		Input split bytes=107
    		Combine input records=0
    		Combine output records=0
    		Reduce input groups=5
    		Reduce shuffle bytes=72
    		Reduce input records=11
    		Reduce output records=5
    		Spilled Records=22
    		Shuffled Maps =1
    		Failed Shuffles=0
    		Merged Map outputs=1
    		GC time elapsed (ms)=3
    		CPU time spent (ms)=0
    		Physical memory (bytes) snapshot=0
    		Virtual memory (bytes) snapshot=0
    		Total committed heap usage (bytes)=462422016
    	Shuffle Errors
    		BAD_ID=0
    		CONNECTION=0
    		IO_ERROR=0
    		WRONG_LENGTH=0
    		WRONG_MAP=0
    		WRONG_REDUCE=0
    	File Input Format Counters 
    		Bytes Read=53
    	File Output Format Counters 
    		Bytes Written=37
     DEBUG - PrivilegedAction as:hxsyl (auth:SIMPLE) from:org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:323)
     DEBUG - stopping client from cache: org.apache.hadoop.ipc.Client@37afeb11
     DEBUG - removing client from cache: org.apache.hadoop.ipc.Client@37afeb11
     DEBUG - stopping actual client because no more references remain: org.apache.hadoop.ipc.Client@37afeb11
     DEBUG - Stopping client
     DEBUG - IPC Client (521081105) connection to /192.168.58.180:8020 from hxsyl: closed
     DEBUG - IPC Client (521081105) connection to /192.168.58.180:8020 from hxsyl: stopped, remaining connections 0
     
    

      

    大神分析一下如何执行的,看着日志....Map如何输入的,执行几次等。

  • 相关阅读:
    jupyter同时使用python2、3
    [python之ipython] jupyter notebook在云端服务器上开启,本地访问
    Transformer的PyTorch实现--转载
    二叉树中和为某一值的路径
    hadoop初识笔记
    大数据初识笔记
    mysql快速入门笔记
    118.Java反射-工厂模式
    117.Java观察者设计模式
    116.Java对象的拷贝
  • 原文地址:https://www.cnblogs.com/hxsyl/p/6068706.html
Copyright © 2011-2022 走看看