MapReduce的MapTask任务的运行源码级分析 这篇文章好不容易恢复了。。。谢天谢地。。。这篇文章讲了MapTask的执行流程。咱们这一节讲解ReduceTask的执行流程。ReduceTask也有四种任务,可参考前一章节对应的内容,至于Reduce Task要从各个Map Task上读取一片数据,经过排序后,以组为单位交给用户编写的reduce方法,并将结果写入HDFS中。
MapTask和ReduceTask都是Task的子类,分别对应于我们常说的map和reduce任务。同上一节一样Child类中直接运行的是run方法,ReduceTask.run()方法代码如下:
1 //ReduceTask.run方法开始和MapTask类似,包括initialize()初始化,根据情况看是否调用runJobCleanupTask(), 2 //runJobSetupTask(),runTaskCleanupTask()。之后进入正式的工作,主要有这么三个步骤:Copy、Sort、Reduce。 3 @Override 4 @SuppressWarnings("unchecked") 5 public void run(JobConf job, final TaskUmbilicalProtocol umbilical) 6 throws IOException, InterruptedException, ClassNotFoundException { 7 this.umbilical = umbilical; 8 job.setBoolean("mapred.skip.on", isSkipping()); 9 /*添加reduce过程需要经过的几个阶段。以便通知TaskTracker目前运 行的情况*/ 10 if (isMapOrReduce()) { 11 copyPhase = getProgress().addPhase("copy"); 12 sortPhase = getProgress().addPhase("sort"); 13 reducePhase = getProgress().addPhase("reduce"); 14 } 15 // start thread that will handle communication with parent 16 // 设置并启动reporter进程以便和TaskTracker进行交流 17 TaskReporter reporter = new TaskReporter(getProgress(), umbilical, 18 jvmContext); 19 reporter.startCommunicationThread(); 20 //在job client中初始化job时,默认就是用新的API,详见Job.setUseNewAPI()方法 21 boolean useNewApi = job.getUseNewReducer(); 22 /*用来初始化任务,主要是进行一些和任务输出相关的设置,比如创建commiter,设置工作目录等*/ 23 initialize(job, getJobID(), reporter, useNewApi);//这里将会处理输出目录 24 /*以下4个if语句均是根据任务类型的不同进行相应的操作,这些方 法均是Task类的方法,所以与任务是MapTask还是ReduceTask无关*/ 25 // check if it is a cleanupJobTask 26 if (jobCleanup) { 27 runJobCleanupTask(umbilical, reporter); 28 return; 29 } 30 if (jobSetup) { 31 //主要是创建工作目录的FileSystem对象 32 runJobSetupTask(umbilical, reporter); 33 return; 34 } 35 if (taskCleanup) { 36 //设置任务目前所处的阶段为结束阶段,并且删除工作目录 37 runTaskCleanupTask(umbilical, reporter); 38 return; 39 } 40 41 // Initialize the codec 42 codec = initCodec(); 43 44 boolean isLocal = "local".equals(job.get("mapred.job.tracker", "local")); //判断是否是单机hadoop 45 if (!isLocal) { 46 //1. Copy.就是从执行各个Map任务的服务器那里,收到map的输出文件。拷贝的任务,是由ReduceTask.ReduceCopier 类来负责。 47 //ReduceCopier对象负责将Map函数的输出拷贝至Reduce所在机器 48 reduceCopier = new ReduceCopier(umbilical, job, reporter); 49 if (!reduceCopier.fetchOutputs()) {////fetchOutputs函数负责拷贝各个Map函数的输出 50 if(reduceCopier.mergeThrowable instanceof FSError) { 51 throw (FSError)reduceCopier.mergeThrowable; 52 } 53 throw new IOException("Task: " + getTaskID() + 54 " - The reduce copier failed", reduceCopier.mergeThrowable); 55 } 56 } 57 copyPhase.complete(); // copy is already complete 58 setPhase(TaskStatus.Phase.SORT); 59 statusUpdate(umbilical); 60 61 final FileSystem rfs = FileSystem.getLocal(job).getRaw(); 62 //2.Sort(其实相当于合并).排序工作,就相当于上述排序工作的一个延续。它会在所有的文件都拷贝完毕后进行。 63 //使用工具类Merger归并所有的文件。经过这一个流程,一个合并了所有所需Map任务输出文件的新文件产生了。 64 //而那些从其他各个服务器网罗过来的 Map任务输出文件,全部删除了。 65 66 //根据hadoop是否分布式来决定调用哪种排序方式 67 RawKeyValueIterator rIter = isLocal 68 ? Merger.merge(job, rfs, job.getMapOutputKeyClass(), 69 job.getMapOutputValueClass(), codec, getMapFiles(rfs, true), 70 !conf.getKeepFailedTaskFiles(), job.getInt("io.sort.factor", 100), 71 new Path(getTaskID().toString()), job.getOutputKeyComparator(), 72 reporter, spilledRecordsCounter, null) 73 : reduceCopier.createKVIterator(job, rfs, reporter); 74 75 // free up the data structures 76 mapOutputFilesOnDisk.clear(); 77 78 sortPhase.complete(); // sort is complete 79 setPhase(TaskStatus.Phase.REDUCE); 80 statusUpdate(umbilical); 81 //3.Reduce 1.Reduce任务的最后一个阶段。它会准备好Map的 keyClass("mapred.output.key.class"或"mapred.mapoutput.key.class"), 82 //valueClass("mapred.mapoutput.value.class"或"mapred.output.value.class") 83 //和 Comparator (“mapred.output.value.groupfn.class”或 “mapred.output.key.comparator.class”) 84 Class keyClass = job.getMapOutputKeyClass(); 85 Class valueClass = job.getMapOutputValueClass(); 86 RawComparator comparator = job.getOutputValueGroupingComparator(); 87 //2.根据参数useNewAPI判断执行runNewReduce还是runOldReduce。分析润runNewReduce 88 if (useNewApi) { 89 //3.runNewReducer 90 //0.像报告进程书写一些信息 91 //1.获得一个TaskAttemptContext对象。通过这个对象创建reduce、output及用于跟踪的统计output的RecordWrit、最后创建用于收集reduce结果的Context 92 //2.reducer.run(reducerContext)开始执行reduce 93 runNewReducer(job, umbilical, reporter, rIter, comparator, 94 keyClass, valueClass); 95 } else { 96 runOldReducer(job, umbilical, reporter, rIter, comparator, 97 keyClass, valueClass); 98 } 99 done(umbilical, reporter); 100 }
(1)reduce分为三个阶段(copy就是远程拷贝Map的输出数据、sort就是对所有的数据做排序、reduce做聚集就是我们自己写的reducer),为这三个阶段分别设置Progress,用来和TaskTracker通信报道状态。
(2)上面代码的15-40行和MapReduce的MapTask任务的运行源码级分析 中对应部分基本相同,可参考之;
(3)codec = initCodec()这句是检查map的输出是否是压缩的,压缩的则返回压缩codec实例,否则返回null,这里讨论不压缩的;
(4)我们讨论完全分布式的hadoop,即isLocal==false,然后构造一个ReduceCopier对象reduceCopier,并调用reduceCopier.fetchOutputs()方法拷贝各个Mapper的输出,到本地;
(5)然后copy阶段完成,设置接下来的阶段是sort阶段,更新状态信息;
(6)根据isLocal来选择KV迭代器,完全分布式的会使用reduceCopier.createKVIterator(job, rfs, reporter)作为KV迭代器;
(7)sort阶段完成,设置接下来的阶段是reduce阶段,更新状态信息;
(8)然后获取一些配置信息,并根据是否使用新API选择不同的处理方式,这里是新的API,调用runNewReducer(job, umbilical, reporter, rIter, comparator, keyClass, valueClass)会执行reducer;
(9)done(umbilical, reporter)这个方法用于做结束任务的一些清理工作:更新计数器updateCounters();如果任务需要提交,设置Taks状态为COMMIT_PENDING,并利用TaskUmbilicalProtocol,汇报Task完成,等待提交,然后调用commit提交任务;设置任务结束标志位;结束Reporter通信线程;发送最后一次统计报告(通过sendLastUpdate方法);利用TaskUmbilicalProtocol报告结束状态(通过sendDone方法)。
有些人将Reduce Task分为了5个阶段:一、shuffle阶段:也称为Copy阶段,就是从各个MapTask上远程拷贝一片数据,如果大小超过一定阈值就写到磁盘,否则放入内存;二、Merge阶段:在远程拷贝数据的同时,Reduce Task启动了两个后台线程对内存和磁盘上的文件进行合并,防止内存使用过多和磁盘文件过多;三、sort阶段:用户编写的reduce方法的输入数据是按key进行聚集的,需要对copy过来的数据排序,这里用的是归并排序,因为Map Task的结果是有序的;四、Reduce阶段:将每组数据依次交给用户编写的Reduce方法处理;五、write阶段:就是将结果写入HDFS。
上面的5个阶段分的比较细了,代码里分为3个阶段copy、sort、reduce,我们在eclipse运行MR程序时,控制台看到的reduce阶段的百分比就分为3个阶段各占33.3%。
接下来重点将两个个地方:runNewReducer方法和ReduceCopier类,后者有2000多行代码,占据了ReduceTask类的绝大部分代码量。
A、我们先看runNewReducer吧,这个比ReduceCopier更容易一些,代码如下:
1 @SuppressWarnings("unchecked") 2 private <INKEY,INVALUE,OUTKEY,OUTVALUE> 3 void runNewReducer(JobConf job, 4 final TaskUmbilicalProtocol umbilical, 5 final TaskReporter reporter, 6 RawKeyValueIterator rIter, 7 RawComparator<INKEY> comparator, 8 Class<INKEY> keyClass, 9 Class<INVALUE> valueClass 10 ) throws IOException,InterruptedException, 11 ClassNotFoundException { 12 // wrap value iterator to report progress. 13 final RawKeyValueIterator rawIter = rIter; 14 rIter = new RawKeyValueIterator() { 15 public void close() throws IOException { 16 rawIter.close(); 17 } 18 public DataInputBuffer getKey() throws IOException { 19 return rawIter.getKey(); 20 } 21 public Progress getProgress() { 22 return rawIter.getProgress(); 23 } 24 public DataInputBuffer getValue() throws IOException { 25 return rawIter.getValue(); 26 } 27 public boolean next() throws IOException { 28 boolean ret = rawIter.next(); 29 reducePhase.set(rawIter.getProgress().get()); 30 reporter.progress(); 31 return ret; 32 } 33 }; 34 // make a task context so we can get the classes 35 /*TaskAttemptContext类继承于JobContext类,相对于JobContext类增加了一些有关task的信息。通过taskContext对象可以获得很多与任务执行相 36 关的类,比如用户定义的Mapper类,InputFormat类等等 */ 37 org.apache.hadoop.mapreduce.TaskAttemptContext taskContext = 38 new org.apache.hadoop.mapreduce.TaskAttemptContext(job, getTaskID()); 39 // make a reducer 40 //创建用户定义的Reduce类的实例 41 org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE> reducer = 42 (org.apache.hadoop.mapreduce.Reducer<INKEY,INVALUE,OUTKEY,OUTVALUE>) 43 ReflectionUtils.newInstance(taskContext.getReducerClass(), job); 44 45 org.apache.hadoop.mapreduce.RecordWriter<OUTKEY,OUTVALUE> trackedRW = 46 new NewTrackingRecordWriter<OUTKEY, OUTVALUE>(reduceOutputCounter, 47 job, reporter, taskContext); 48 job.setBoolean("mapred.skip.on", isSkipping()); 49 org.apache.hadoop.mapreduce.Reducer.Context 50 reducerContext = createReduceContext(reducer, job, getTaskID(), 51 rIter, reduceInputKeyCounter, 52 reduceInputValueCounter, 53 trackedRW, committer, 54 reporter, comparator, keyClass, 55 valueClass); 56 reducer.run(reducerContext); 57 trackedRW.close(reducerContext); 58 }
(1)参数RawKeyValueIterator rIter实际上是org.apache.hadoop.mapred.Merger.MergeQueue。这里将rIter赋值给新的RawKeyValueIterator rawIter,然后将rIter重新实现了RawKeyValueIterator,可以跟踪和汇报rawIter进度;
(2)构造任务配置类以及获取用户自己的Reducer类的实例,然后创建一个NewTrackingRecordWriter的对象trackedRW作为输出;
(3)将rIter、trackedRW等信息传递给org.apache.hadoop.mapreduce.Reducer.Context ,构造了一个管理读写的配置对象;在其父类ReduceContext中对输入就是迭代器的操作进行了实现;在ReduceContext的父类TaskInputOutputContext中实现输出的方法,其write方法会直接调用trackedRW.write(key,value)
(4)reducer.run(reducerContext)执行reducer的run方法,这个run方法和上一节中的基本相同,可参考之;
(5)关闭输出trackedRW.close(reducerContext)。
一、这里还得解释一下NewTrackingRecordWriter这个管理输出的类,是mapreduce.RecordWriter的子类,和上一节中的NewDirectOutputCollector较为类似,这里不再讲解。
二、至于输入数据rIter迭代器,在此需要解释一下,实现同一个key的不同value迭代读取的功能在ReduceContext中,讲之前,我们先看一下Reducer.run()方法的代码吧:
1 public void run(Context context) throws IOException, InterruptedException { 2 setup(context); 3 while (context.nextKey()) { 4 reduce(context.getCurrentKey(), context.getValues(), context); 5 } 6 cleanup(context); 7 }
我们只说while循环这一部分,其他部分前一小节有讲解,基本类似。while的循环条件是ReduceContext.nextKey()为真,这个方法就在ReduceContext中实现的,这个方法的目的就是处理下一个唯一的key(就是要保证是新的key),因为reduce方法的输入数据是分组的,所以每次都会处理一个key及这个key对应的所有value,又因为已经将所有的Map Task的输出拷贝过来而且做了排序,所以key相同的KV对都是挨着的。来看nextKey()方法代码:
1 /** Start processing next unique key. */ 2 public boolean nextKey() throws IOException,InterruptedException { 3 while (hasMore && nextKeyIsSame) { //如果还有数据并且下一个KV中的K与当前的相同就一直循环直到key不相同,一般不会执行这个,因为value的迭代器会迭代到nextKeyIsSame==false 4 nextKeyValue(); 5 } 6 if (hasMore) { //如果还有数据 7 if (inputKeyCounter != null) { 8 inputKeyCounter.increment(1); //统计 9 } 10 return nextKeyValue(); //推进到下一个KV 11 } else { 12 return false; 13 } 14 }
上述方法会调用另外一个方法nextKeyValue()会尝试去获取下一个key值,并且如果没数据了就会返回false,如果还有数据就返回true,具体代码如下:
1 public boolean nextKeyValue() throws IOException, InterruptedException { 2 if (!hasMore) { 3 key = null; 4 value = null; 5 return false; 6 } 7 firstValue = !nextKeyIsSame; //这个是否是同一个key值的不同value,第一个value的话firstValue==true并且nextKeyIsSame==false,后续的会是false,nextKeyIsSame是true 8 DataInputBuffer next = input.getKey(); 9 currentRawKey.set(next.getData(), next.getPosition(), 10 next.getLength() - next.getPosition()); 11 buffer.reset(currentRawKey.getBytes(), 0, currentRawKey.getLength()); 12 key = keyDeserializer.deserialize(key); //反序列化获取key值 13 next = input.getValue(); 14 buffer.reset(next.getData(), next.getPosition(), next.getLength()); 15 value = valueDeserializer.deserialize(value); //反序列化获取value值 16 hasMore = input.next(); //是否还有数据 17 if (hasMore) { 18 next = input.getKey(); 19 nextKeyIsSame = comparator.compare(currentRawKey.getBytes(), 0, 20 currentRawKey.getLength(), 21 next.getData(), 22 next.getPosition(), 23 next.getLength() - next.getPosition() 24 ) == 0; //查看下一个KV的key是否与当前的一样 25 } else { //没有数据了 26 nextKeyIsSame = false; 27 } 28 inputValueCounter.increment(1); 29 return true; 30 }
这里面有两个比较重要的参数:firstValue表示是否是当前key值的第一个value;nextKeyIsSame表示下一个key是否和当前key值相同。这两个参数在迭代获取value的时候会有重要作用。在这个方法中会获取key和value,可以通过getCurrentKey()和getCurrentValue()方法来获取这两个值。这个方法还会读取下一个key与当前的key作比较,如果相同则nextKeyIsSame=true,否则nextKeyIsSame=false。
此时我们再返回到run()方法中,循环条件已了解,那么循环体的秘密呢?用户自己的reduce方法还记得么?一个key和一个这个key对应的value迭代器,没错在这分别对应context.getCurrentKey()和context.getValues()。下面我们重点研究一下后者context.getValues(),这个方法也在ReduceContext类中,这个方法主要是返回一个可迭代对象ValueIterable,它封装了迭代器ValueIterator,这个迭代器实现了对value的迭代读取,这个类的全部代码如下:
1 protected class ValueIterator implements Iterator<VALUEIN> { 2 3 @Override 4 public boolean hasNext() { 5 return firstValue || nextKeyIsSame; 6 } 7 8 @Override 9 public VALUEIN next() { 10 // if this is the first record, we don't need to advance 11 if (firstValue) { 12 firstValue = false; 13 return value; 14 } 15 // if this isn't the first record and the next key is different, they 16 // can't advance it here. 17 if (!nextKeyIsSame) { 18 throw new NoSuchElementException("iterate past last value"); 19 } 20 // otherwise, go to the next key/value pair 21 try { //firstValue==false and nextKeyIsSame == true 22 nextKeyValue(); 23 return value; 24 } catch (IOException ie) { 25 throw new RuntimeException("next value iterator failed", ie); 26 } catch (InterruptedException ie) { 27 // this is bad, but we can't modify the exception list of java.util 28 throw new RuntimeException("next value iterator interrupted", ie); 29 } 30 } 31 32 @Override 33 public void remove() { 34 throw new UnsupportedOperationException("remove not implemented"); 35 } 36 37 }
hasNext()判断是否还有下一个value,由上面说的firstValue和nextKeyIsSame决定,只要有一个是true就说明有下一个value,为什么呢,请看上面对着两个参数的解释,自行理解吧,很明显。
next()方法就是读取value的地方,这有几种情况需要分析:1、如果firstValue==true,则直接返回当前的value,大伙这没问题吧;2、如果firstValue==false and nextKeyIsSame == false,这明显不科学,哪有下一个key不相同且又不是第一个value的情况呢?所以报错;3、如果firstValue==false and nextKeyIsSame == true 说明下一个KV的key和当前key相同且不是第一个value,可能是第N个,所以需要调用nextKeyValue()获取下一个value并返回。reduce就是通过这种机制不断去获取同一个key的所有valude的。
这个上面二中的输入数据迭代器就明了了。
B、下面就是ReduceCopier类了,这个类承载的工作量很大,也比较复杂。
重点的方法是ReduceCopier.fetchOutputs()这个方法负责拷贝各个Map函数的输出,代码也比较多接近400行,代码如下,里面有一些注释:
1 //通过ReduceCopier的fetchOutputs()方法取得map的结果 2 public boolean fetchOutputs() throws IOException { 3 int totalFailures = 0; 4 int numInFlight = 0, numCopied = 0; 5 DecimalFormat mbpsFormat = new DecimalFormat("0.00"); 6 final Progress copyPhase = 7 reduceTask.getProgress().phase(); 8 //(4)同时合并,还有一个内存Merger线程InMemFSMergeThread和一个文件Merger线程LocalFSMerger在同步工作, 9 //它们将下载过来的文件(可能在内存中,简单的统称为文件...),做着归并排序,以此,节约时间,降低输入文件的数量, 10 //为后续的排序工作减 负。InMemFSMergeThread的run循环调用doInMemMerge,该方法使用工具类Merger实现归并, 11 //如果需要combine,则combinerRunner.combine。 12 LocalFSMerger localFSMergerThread = null; 13 InMemFSMergeThread inMemFSMergeThread = null; 14 //(1)索取任务。使用GetMapEventsThread线程。 15 //该线程的run方法不停的调用getMapCompletionEvents方法, 16 //该方法又使用RPC调用TaskUmbilicalProtocol协议的getMapCompletionEvents, 17 //方法使用所属的jobID向其父TaskTracker询问此作业个Map任务 的完成状况 18 //(TaskTracker要向JobTracker询问后再转告给它...)。返回一个数组TaskCompletionEvent events[]。 19 //TaskCompletionEvent包含taskid和ip地址之类的信息。 20 GetMapEventsThread getMapEventsThread = null; 21 22 for (int i = 0; i < numMaps; i++) { 23 copyPhase.addPhase(); // add sub-phase per file 24 } 25 26 copiers = new ArrayList<MapOutputCopier>(numCopiers); 27 28 // start all the copying threads 29 for (int i=0; i < numCopiers; i++) { 30 //(2)当获取到相关Map任务执行服务器的信息后,有一个线程MapOutputCopier开启,做具体的拷贝工作。 31 //它会在一个单独的线程内,负责某个Map任务服务器上文件的拷贝工作。MapOutputCopier的run循环调用 32 //copyOutput,copyOutput又调用 getMapOutput,使用HTTP远程拷贝。 33 MapOutputCopier copier = new MapOutputCopier(conf, reporter, 34 reduceTask.getJobTokenSecret()); 35 copiers.add(copier); 36 copier.start(); 37 } 38 39 //start the on-disk-merge thread 40 localFSMergerThread = new LocalFSMerger((LocalFileSystem)localFileSys); 41 //start the in memory merger thread 42 inMemFSMergeThread = new InMemFSMergeThread(); 43 localFSMergerThread.start(); 44 inMemFSMergeThread.start(); 45 46 // start the map events thread 47 getMapEventsThread = new GetMapEventsThread(); 48 getMapEventsThread.start(); 49 50 // start the clock for bandwidth measurement 51 long startTime = System.currentTimeMillis(); 52 long currentTime = startTime; 53 long lastProgressTime = startTime; 54 long lastOutputTime = 0; 55 56 // loop until we get all required outputs 57 while (copiedMapOutputs.size() < numMaps && mergeThrowable == null) { 58 59 currentTime = System.currentTimeMillis(); 60 boolean logNow = false; 61 if (currentTime - lastOutputTime > MIN_LOG_TIME) { 62 lastOutputTime = currentTime; 63 logNow = true; 64 } 65 if (logNow) { 66 LOG.info(reduceTask.getTaskID() + " Need another " 67 + (numMaps - copiedMapOutputs.size()) + " map output(s) " 68 + "where " + numInFlight + " is already in progress"); 69 } 70 71 // Put the hash entries for the failed fetches. 72 Iterator<MapOutputLocation> locItr = retryFetches.iterator(); 73 74 while (locItr.hasNext()) { 75 MapOutputLocation loc = locItr.next(); 76 List<MapOutputLocation> locList = 77 mapLocations.get(loc.getHost()); 78 79 // Check if the list exists. Map output location mapping is cleared 80 // once the jobtracker restarts and is rebuilt from scratch. 81 // Note that map-output-location mapping will be recreated and hence 82 // we continue with the hope that we might find some locations 83 // from the rebuild map. 84 if (locList != null) { 85 // Add to the beginning of the list so that this map is 86 //tried again before the others and we can hasten the 87 //re-execution of this map should there be a problem 88 locList.add(0, loc); 89 } 90 } 91 92 if (retryFetches.size() > 0) { 93 LOG.info(reduceTask.getTaskID() + ": " + 94 "Got " + retryFetches.size() + 95 " map-outputs from previous failures"); 96 } 97 // clear the "failed" fetches hashmap 98 retryFetches.clear(); 99 100 // now walk through the cache and schedule what we can 101 int numScheduled = 0; 102 int numDups = 0; 103 104 synchronized (scheduledCopies) { 105 106 // Randomize the map output locations to prevent 107 // all reduce-tasks swamping the same tasktracker 108 List<String> hostList = new ArrayList<String>(); 109 hostList.addAll(mapLocations.keySet()); 110 111 Collections.shuffle(hostList, this.random);//混洗,降低热点的出现 112 113 Iterator<String> hostsItr = hostList.iterator(); 114 115 while (hostsItr.hasNext()) { 116 117 String host = hostsItr.next(); 118 119 List<MapOutputLocation> knownOutputsByLoc = 120 mapLocations.get(host); 121 122 // Check if the list exists. Map output location mapping is 123 // cleared once the jobtracker restarts and is rebuilt from 124 // scratch. 125 // Note that map-output-location mapping will be recreated and 126 // hence we continue with the hope that we might find some 127 // locations from the rebuild map and add then for fetching. 128 if (knownOutputsByLoc == null || knownOutputsByLoc.size() == 0) { 129 continue; 130 } 131 132 //Identify duplicate hosts here 133 if (uniqueHosts.contains(host)) { 134 numDups += knownOutputsByLoc.size(); 135 continue; 136 } 137 138 Long penaltyEnd = penaltyBox.get(host); 139 boolean penalized = false; 140 141 if (penaltyEnd != null) { 142 if (currentTime < penaltyEnd.longValue()) { 143 penalized = true; 144 } else { 145 penaltyBox.remove(host); 146 } 147 } 148 149 if (penalized) 150 continue; 151 152 synchronized (knownOutputsByLoc) { 153 154 locItr = knownOutputsByLoc.iterator(); 155 156 while (locItr.hasNext()) { 157 158 MapOutputLocation loc = locItr.next(); 159 160 // Do not schedule fetches from OBSOLETE maps 161 if (obsoleteMapIds.contains(loc.getTaskAttemptId())) { 162 locItr.remove(); 163 continue; 164 } 165 166 uniqueHosts.add(host); 167 scheduledCopies.add(loc); 168 locItr.remove(); // remove from knownOutputs 169 numInFlight++; numScheduled++; 170 171 break; //we have a map from this host 172 } 173 } 174 } 175 scheduledCopies.notifyAll(); 176 } 177 178 if (numScheduled > 0 || logNow) { 179 LOG.info(reduceTask.getTaskID() + " Scheduled " + numScheduled + 180 " outputs (" + penaltyBox.size() + 181 " slow hosts and" + numDups + " dup hosts)"); 182 } 183 184 if (penaltyBox.size() > 0 && logNow) { 185 LOG.info("Penalized(slow) Hosts: "); 186 for (String host : penaltyBox.keySet()) { 187 LOG.info(host + " Will be considered after: " + 188 ((penaltyBox.get(host) - currentTime)/1000) + " seconds."); 189 } 190 } 191 192 // if we have no copies in flight and we can't schedule anything 193 // new, just wait for a bit 194 try { 195 if (numInFlight == 0 && numScheduled == 0) { 196 // we should indicate progress as we don't want TT to think 197 // we're stuck and kill us 198 reporter.progress(); 199 Thread.sleep(5000); 200 } 201 } catch (InterruptedException e) { } // IGNORE 202 203 while (numInFlight > 0 && mergeThrowable == null) { 204 LOG.debug(reduceTask.getTaskID() + " numInFlight = " + 205 numInFlight); 206 //the call to getCopyResult will either 207 //1) return immediately with a null or a valid CopyResult object, 208 // or 209 //2) if the numInFlight is above maxInFlight, return with a 210 // CopyResult object after getting a notification from a 211 // fetcher thread, 212 //So, when getCopyResult returns null, we can be sure that 213 //we aren't busy enough and we should go and get more mapcompletion 214 //events from the tasktracker 215 CopyResult cr = getCopyResult(numInFlight); 216 217 if (cr == null) { 218 break; 219 } 220 221 if (cr.getSuccess()) { // a successful copy 222 numCopied++; 223 lastProgressTime = System.currentTimeMillis(); 224 reduceShuffleBytes.increment(cr.getSize()); 225 226 long secsSinceStart = 227 (System.currentTimeMillis()-startTime)/1000+1; 228 float mbs = ((float)reduceShuffleBytes.getCounter())/(1024*1024); 229 float transferRate = mbs/secsSinceStart; 230 231 copyPhase.startNextPhase(); 232 copyPhase.setStatus("copy (" + numCopied + " of " + numMaps 233 + " at " + 234 mbpsFormat.format(transferRate) + " MB/s)"); 235 236 // Note successful fetch for this mapId to invalidate 237 // (possibly) old fetch-failures 238 fetchFailedMaps.remove(cr.getLocation().getTaskId()); 239 } else if (cr.isObsolete()) { 240 //ignore 241 LOG.info(reduceTask.getTaskID() + 242 " Ignoring obsolete copy result for Map Task: " + 243 cr.getLocation().getTaskAttemptId() + " from host: " + 244 cr.getHost()); 245 } else { 246 retryFetches.add(cr.getLocation()); 247 248 // note the failed-fetch 249 TaskAttemptID mapTaskId = cr.getLocation().getTaskAttemptId(); 250 TaskID mapId = cr.getLocation().getTaskId(); 251 252 totalFailures++; 253 Integer noFailedFetches = 254 mapTaskToFailedFetchesMap.get(mapTaskId); 255 noFailedFetches = 256 (noFailedFetches == null) ? 1 : (noFailedFetches + 1); 257 mapTaskToFailedFetchesMap.put(mapTaskId, noFailedFetches); 258 LOG.info("Task " + getTaskID() + ": Failed fetch #" + 259 noFailedFetches + " from " + mapTaskId); 260 261 if (noFailedFetches >= abortFailureLimit) { 262 LOG.fatal(noFailedFetches + " failures downloading " 263 + getTaskID() + "."); 264 umbilical.shuffleError(getTaskID(), 265 "Exceeded the abort failure limit;" 266 + " bailing-out.", jvmContext); 267 } 268 269 checkAndInformJobTracker(noFailedFetches, mapTaskId, 270 cr.getError().equals(CopyOutputErrorType.READ_ERROR)); 271 272 // note unique failed-fetch maps 273 if (noFailedFetches == maxFetchFailuresBeforeReporting) { 274 fetchFailedMaps.add(mapId); 275 276 // did we have too many unique failed-fetch maps? 277 // and did we fail on too many fetch attempts? 278 // and did we progress enough 279 // or did we wait for too long without any progress? 280 281 // check if the reducer is healthy 282 boolean reducerHealthy = 283 (((float)totalFailures / (totalFailures + numCopied)) 284 < MAX_ALLOWED_FAILED_FETCH_ATTEMPT_PERCENT); 285 286 // check if the reducer has progressed enough 287 boolean reducerProgressedEnough = 288 (((float)numCopied / numMaps) 289 >= MIN_REQUIRED_PROGRESS_PERCENT); 290 291 // check if the reducer is stalled for a long time 292 // duration for which the reducer is stalled 293 int stallDuration = 294 (int)(System.currentTimeMillis() - lastProgressTime); 295 // duration for which the reducer ran with progress 296 int shuffleProgressDuration = 297 (int)(lastProgressTime - startTime); 298 // min time the reducer should run without getting killed 299 int minShuffleRunDuration = 300 (shuffleProgressDuration > maxMapRuntime) 301 ? shuffleProgressDuration 302 : maxMapRuntime; 303 boolean reducerStalled = 304 (((float)stallDuration / minShuffleRunDuration) 305 >= MAX_ALLOWED_STALL_TIME_PERCENT); 306 307 // kill if not healthy and has insufficient progress 308 if ((fetchFailedMaps.size() >= maxFailedUniqueFetches || 309 fetchFailedMaps.size() == (numMaps - copiedMapOutputs.size())) 310 && !reducerHealthy 311 && (!reducerProgressedEnough || reducerStalled)) { 312 LOG.fatal("Shuffle failed with too many fetch failures " + 313 "and insufficient progress!" + 314 "Killing task " + getTaskID() + "."); 315 umbilical.shuffleError(getTaskID(), 316 "Exceeded MAX_FAILED_UNIQUE_FETCHES;" 317 + " bailing-out.", jvmContext); 318 } 319 320 } 321 322 currentTime = System.currentTimeMillis(); 323 long currentBackOff = (long)(INITIAL_PENALTY * 324 Math.pow(PENALTY_GROWTH_RATE, noFailedFetches)); 325 326 penaltyBox.put(cr.getHost(), currentTime + currentBackOff); 327 LOG.warn(reduceTask.getTaskID() + " adding host " + 328 cr.getHost() + " to penalty box, next contact in " + 329 (currentBackOff/1000) + " seconds"); 330 } 331 uniqueHosts.remove(cr.getHost()); 332 numInFlight--; 333 } 334 } 335 336 // all done, inform the copiers to exit 337 exitGetMapEvents= true; 338 try { 339 getMapEventsThread.join(); 340 LOG.info("getMapsEventsThread joined."); 341 } catch (InterruptedException ie) { 342 LOG.info("getMapsEventsThread threw an exception: " + 343 StringUtils.stringifyException(ie)); 344 } 345 346 synchronized (copiers) { 347 synchronized (scheduledCopies) { 348 for (MapOutputCopier copier : copiers) { 349 copier.interrupt(); 350 } 351 copiers.clear(); 352 } 353 } 354 355 // copiers are done, exit and notify the waiting merge threads 356 synchronized (mapOutputFilesOnDisk) { 357 exitLocalFSMerge = true; 358 mapOutputFilesOnDisk.notify(); 359 } 360 361 ramManager.close(); 362 363 //Do a merge of in-memory files (if there are any) 364 if (mergeThrowable == null) { 365 try { 366 // Wait for the on-disk merge to complete 367 localFSMergerThread.join(); 368 LOG.info("Interleaved on-disk merge complete: " + 369 mapOutputFilesOnDisk.size() + " files left."); 370 371 //wait for an ongoing merge (if it is in flight) to complete 372 inMemFSMergeThread.join(); 373 LOG.info("In-memory merge complete: " + 374 mapOutputsFilesInMemory.size() + " files left."); 375 } catch (InterruptedException ie) { 376 LOG.warn(reduceTask.getTaskID() + 377 " Final merge of the inmemory files threw an exception: " + 378 StringUtils.stringifyException(ie)); 379 // check if the last merge generated an error 380 if (mergeThrowable != null) { 381 mergeThrowable = ie; 382 } 383 return false; 384 } 385 } 386 return mergeThrowable == null && copiedMapOutputs.size() == numMaps; 387 } 388
该方法会构造多个线程对象:1个LocalFSMerger线程、1个InMemFSMergeThread线程、1个GetMapEventsThread线程、若干个(由"mapred.reduce.parallel.copies"决定,默认是5)MapOutputCopier线程。
(1)先开若干个MapOutputCopier,并启动线程,加入copiers存储列表。这个线程的run方法中有个死循环,一直监控scheduledCopies列表,这个列表表示正在拷贝的map输出的列表,当scheduledCopies一旦发现有MapOutputLocation就获取第一个MapOutputLocation,调用方法copyOutput(loc)来从远程通过HTTP拷贝Map的输出数据。copyOutput(loc)方法首先检查这个MapOutputLocation是否在copiedMapOutputs和obsoleteMapIds之中,是不能拷贝的,如果在就直接返回-2;然后通过getMapOutput(MapOutputLocation mapOutputLoc, Path filename, int reduce)方法与远程taskTracker建立连接,并获取输入流,通过一系列检查之后检查内存文件系统是否可以放得下这个map输出,如果可以放得下就通过shuffleInMemory方法将这个文件放入内存,否则通过shuffleToDisk刷新到磁盘(shuffleInMemory方法会等待内存释放足够的空间并会关闭输入流再再次建立输入流,在内存中开辟空间,将map数据拷贝到这这段空间中并封装到MapOutput中,然后返回这个MapOutput;shuffleToDisk方法首先会找一个合适的本地位置来存储map的输出,然后构造一个MapOutput对象,并从输入流持续的写到输出流指定的文件中,将这个文件封装到MapOutput中,返回MapOutput)。再返回到copyOutput方法,再对返回的MapOutput做一些检查最终如果是在内存中则mapOutputsFilesInMemory.add(mapOutput);否则是在本地磁盘对其重命名并将这个文件对应的FileStatus加入mapOutputFilesOnDisk。run方法中的finally中的finish方法将已经拷贝的MapOutputLocation放入copyResults。
(2)构造LocalFSMerger对象并启动线程,其run方法如果exitLocalFSMerge==false就会一直等待本地文件数量>=(2 * ioSortFactor - 1),会触发本地文件合并操作,ioSortFactor是参数"io.sort.factor",默认是10。然后会从 mapOutputFilesOnDisk(是SortedSet类型)中选取最小的前10个文件放入mapFiles,通过Merger.merge归并排序这10个文件,写入writer指定的文件,并将新文件放入mapOutputFilesOnDisk中。这里如果设置了combiner,也不会调用。
(3)构造InMemFSMergeThread对象并启动线程,其run方法循环检查内存中的文件是否可以合并通过exit = ramManager.waitForDataToMerge(),如果满足以下几个条件之一就会触发合并内存文件的操作:一、数据拷贝完毕后,关闭ShuffleRamManager;二、ShuffleRamManager 中已使用内存超过可用内存的“mapred.job.shuffle.merge.percent”,默认是0.66且内存文件数目超过2个;三、内存中 的文件数目超过“mapred.inmem.merge.threshold”,默认是1000;四、阻塞在ShuffleRamManager上的请求数目超过拷贝线程数"mapred.reduce.parallel.copies"的0.75。满足条件就会调用doInMemMerge()方法来执行合并操作,该方法使用工具类Merger实现归并,如果设置了combiner,则在写入本地文件之前通过combinerRunner.combine来将排序后的数据聚集后写入writer指定的本地文件中。这里有个问题要注意就是run方法中是do-while循环,循环条件是(!exit),即当exit==false时才会持续的运行,waitForDataToMerge方法中可以看出来只有ramManager关闭之后才会返回true。
(4)构造GetMapEventsThread对象并启动线程。此线程的run方法是每隔1s调用getMapCompletionEvents()方法直到exitGetMapEvents==true(会在fetchOutputs()中赋值true),这个方法会与TaskTracker通信调用TaskTracker.getMapCompletionEvents已经获取到的etionEvents方法获取已完成的Map Task列表:规则是先查找shouldReset有没有当前reduce task对应的ID,如果有说明要正在shuffle要回滚,则就返回一个要reset的MapTaskCompletionEventsUpdate;如果shouldReset没有,则从runningJobs中找到当前reduce task所属的Job的FetchStatus;获取新增的完成的map task列表FetchStatus.getMapEvents(fromEventId, maxLocs),从allMapEvents中获取需要的已完成的map,然后封装到这个列表到MapTaskCompletionEventsUpdate,再返回。那么allMapEvents中的数据是如何来的呢?TaskTracker有个MapEventsFetcherThread线程,其run方法会周期性的去获取runningJobs所有的job中第一个处于SHUFFLE阶段的reduce task对应job的FetchStatus,然后对每个FetchStatus调用其fetchMapCompletionEvents(currentTime)方法调用queryJobTracker(fromEventId, jobId, jobClient)方法与JobTracker通信通过JobTracker.getTaskCompletionEvents方法从JobInProgress中的taskCompletionEvents来获取满足条件的TaskCompletionEvent,从中找出是Map task的更新allMapEvents。
getMapCompletionEvents()方法中获取到了MapTaskCompletionEventsUpdate之后,就将已完成的map列表放入TaskCompletionEvent events[]之中;如果是reset的,则重置fromEventId、obsoleteMapIds、mapLocations;然后更新fromEventId表示已经获取到已完成map的最新编号,以后再获取新增将会是这个编号之后的。然后遍历events中的所有TaskCompletionEvent,根据每个的状态:如果是SUCCEEDED,则放入mapLocations(保存了TaskTracker Host与已完成任务列表的映射关系)可以去取map的输出数据;如果是OBSOLETE/FAILED/KILLED,就放入obsoleteOutputs,表示停止从这些map取数据;如果是TIPFAILED,则放入copiedMapOutputs表示不需要从这些map去取数据。然后返回mapLocations新增的的个数。
在fetchOutputs()方法中这些线程启动之后,还不能工作,还需要将mapLocations中合适的MapOutputLocation放入scheduledCopies唤醒MapOutputCopier线程去拷贝,如果A、所有的拷贝结果中会将拷贝成功的从fetchFailedMaps中删除;B、是Obsolete的会忽略;C、其他失败的加入retryFetches,并且对应mapTaskId的失败次数会加1,并放入mapTaskToFailedFetchesMap之中,这个结构是用来存放mapTaskId和对应的失败次数的,容错机制一:拷贝失败次数超过上限(Math.max(30, numMaps / 10))就会杀死该Reduce Task(等待调度器重新调度执行);容错机制二:一旦拷贝失败次数>=maxFetchFailuresBeforeReporting(由参数"mapreduce.reduce.shuffle.maxfetchfailures"指定,默认是10),就加入fetchFailedMaps,同时满足以下条件就会杀死这个reduce task:一、reducer所在节点不健康;二、fetchFailedMaps的大小超过上限(默认是5)或者等于所有的reducer需要的所有的map的个数减去copiedMapOutputs的大小;三、reducer没有足够的Progress或者reducer超时停滞了,容错三、如果前两个条件均不满足,则采用对数回归模型推迟一段时间后重新拷贝对应的map的输出数据,延迟时间是10000*Math.pow(1.3, noFailedFetches)),并放入penaltyBox中进行惩罚。最后待copy操作完成会做一些清理工作:会关闭ramManager,触发InMemFSMergeThread线程结束退出;exitGetMapEvents=true会使得GetMapEventsThread结束退出;exitLocalFSMerge=true会使得LocalFSMerger线程结束退出;挨个中断copiers中所有拷贝线程MapOutputCopier,清理copiers.clear()。
至此reduce task算是讲解完毕,mapreduce的整个过程已经讲解了很多内容,大体的过程已知。还有许多东西没有涉及,比如恢复机制、容错机制、任务的推测、快排和归并、文件流的过程包括文件名和位置等等。后续还会继续研究。
参考:1、董西成,《hadoop技术内幕---深入理解MapReduce架构设计与实现原理》