先说一下,这个其实是我为实现PantaRay或者是类似Dreamworks的Out of Core点云GI的技术储备,为大规模点云光线跟踪所准备的第一步。在实际的应用中,int类型会被64bit的uint64_t所代替,代表空间中的一个hash键。所有的代码全部使用STL+boost实现了足够高层次的抽象,读者完全可以根据自己的需要改写。
This is the first step to implement the PantaRay or the GI solution from Dreamworks about Out-Core point cloud sorting. Actually the int type in the code would be replaced by he uint64_t which indices a hash key in space. All fragments code are using the STL+Boost, user can modify the code by yourself.
我们先来准备测试数据。这个测试数据有尺寸大小的限制,就是在现在x86_64环境下malloc/new分配的单个数组有1G尺寸的限制,这样就意味着内排序一次操作的数据不可能大于1G,造成了测试上的限制,所以我只生成了一个尺寸大约962M的文件测试,包含了246324610个int。
First of all, let’s prepare the test data. But as we know, there is the 1G array size limitation in x86_64, so that we can only apply qsort or std::stable_sort to a < 1G array. For this test I generate a 962M file which contains the 246324610 integers.
如下程序生成测试数据,均匀分布的Mersenne Twister 19937序列。
The following program generates the test data, using the MT19937 uniform distribution.
#include <iostream> #include <boost/random/mersenne_twister.hpp> #include <boost/random/uniform_int_distribution.hpp> int main(int argc, char *argv[]) { -- argc, ++ argv; if (argc != 2) { return 1; } char * szPath = argv[0]; int iCount = atoi(argv[1]); std::cout << szPath << " " << iCount << std::endl; boost::random::mt19937 cGen; boost::random::uniform_int_distribution<> cDist(0, 99999999); FILE * pFile = fopen(szPath, "wb"); if (pFile) { for (int i = 0; i < iCount; ++ i) { int iRandom = cDist(cGen); fwrite(& iRandom, sizeof(int), 1, pFile); } fclose(pFile); } return 0; }
然后生成内排序的结果,储存为外部独立文件为了比较。
Generate the internal sorted result to verify the data.
int main(int argc, char * argv[]) { PlaySTL(); -- argc, ++ argv; if (argc != 2) { return EXIT_FAILURE; } FILE * pOriginalFile = fopen(argv[0], "rb"); fseek(pOriginalFile, 0, SEEK_END); long lSize = ftell(pOriginalFile); fseek(pOriginalFile, 0, SEEK_SET); int iNumItems = lSize / 4; int * pData = new int[iNumItems]; fread(pData, sizeof(int), iNumItems, pOriginalFile); fclose(pOriginalFile); std::stable_sort(pData, pData + iNumItems, std::less<int>()); FILE * pSortedFile = fopen(argv[1], "wb"); fwrite(pData, sizeof(int), iNumItems, pSortedFile); fclose(pSortedFile); delete [] pData; return EXIT_SUCCESS; }
从设计的思路上,由于操作系统在磁盘IO上都是单线程的,每次只允许一个线程读写,所以把读取的工作部分都放在主线程中,排序线程为了让磁盘写入的时间占据总共处理的时间尽可能地小,所以尽可能的让一个工作线程处理更多的数据。
Because the disk access is synchronized at low-level IO, so that we will read the data in the main thread, the working thread process as much as data as possible to reduce the percent of time on disk writing.
先让我们定义一个名字叫做Job的类,顾名思义,代表一个计算任务,每个计算任务都有一个自己的索引,以及一堆乱序的整数int数据。
Let’s define a Job class, each Job has a index and unsorted data.
class Job { public: Job() : m_iIndex(0), m_iNumItems(0) { } Job(int iIndex, int iNumItems, const boost::shared_array<int> & aData) : m_iIndex(iIndex), m_iNumItems(iNumItems), m_aData(aData) { } Job(const Job & cCopy) : m_iIndex(cCopy.m_iIndex), m_iNumItems(cCopy.m_iNumItems), m_aData(cCopy.m_aData) { } public: int m_iIndex; int m_iNumItems; boost::shared_array<int> m_aData; };
然后再来一个Context,负责存储用于计算的共享数据,比如工作队列,以及Mutex等为了同步所需要的对象。
Later the Context class, to keep the queue and mutex objects.
class Context { public: Context(int iNumSortingThread) : m_iNumSortingThread(iNumSortingThread), m_bHasMoreData(true) { } public: int m_iNumSortingThread; bool m_bHasMoreData; boost::mutex m_cMutex; boost::condition_variable m_cEvent; std::list<Job > m_lJobQueue; };
这里是工作线程,其中有工作代码的实现。当访问Context中的队列时必须要加锁,抓一个工作包出来,当作局部数据,接下来再排序和写出为Cache,末了尽可能贪婪的告诉主线程我们需要更多的数据,如果真的是没有任何数据了则直接退出。
Here is the working thread, it will get a Job from the queue, sort the data, and write out, at the end, tell the main thread it needs more data to process, if there is no more data it will return.
class SortingThread : public boost::thread { public: SortingThread(const boost::shared_ptr<Context> & pContext) : m_pContext(pContext), boost::thread(boost::bind(& SortingThread::Sort, this)) { } void Sort() { while (1) { if (! m_pContext->m_bHasMoreData) { if (! m_pContext->m_lJobQueue.size()) { break; } } Job cJob; { boost::unique_lock<boost::mutex> cLock(m_pContext->m_cMutex); if (m_pContext->m_lJobQueue.size()) { // Get a job. // cJob = m_pContext->m_lJobQueue.front(); m_pContext->m_lJobQueue.pop_front(); } } if (cJob.m_iNumItems) { std::stable_sort(cJob.m_aData.get(), cJob.m_aData.get() + cJob.m_iNumItems, std::less<int>()); // Write out the sorted data. // char aBuffer[256]; sprintf(aBuffer, "%.06d.tmp", cJob.m_iIndex); std::ofstream cOutput(aBuffer, std::ios_base::binary); cOutput.write(reinterpret_cast<const char *>(cJob.m_aData.get()), cJob.m_iNumItems * sizeof(int)); } // Tell the main thread we need more data here. // m_pContext->m_cEvent.notify_one(); } } private: boost::shared_ptr<Context> m_pContext; };
把所有的线程都放入线程池,这样就可以一股脑的执行了。
The simple thread pool.
class SortingThreadGroup : public boost::thread_group { public: SortingThreadGroup(const boost::shared_ptr<Context> & pContext) : m_pContext(pContext) { for (int i = 0; i < m_pContext->m_iNumSortingThread; ++ i) { SortingThread * pSortingThread = new SortingThread(pContext); add_thread(pSortingThread); } } private: boost::shared_ptr<Context> m_pContext; };
主线程从外部文件读取数据填充Job对象,尽可能的把整个队列的数据控制在一定得范围内,这样内存的占用可以小一些,否则就失去了外排序的意义。
Main thread reads data from file, fills the Job, and keep the memory usage minimal.
bool Sort(const char * szPath, int iNumSortingThreads, int iNumLocalItems) { try { // Calculate real size. // std::ifstream cUnSortedFile(szPath, std::ios_base::binary); boost::uintmax_t ullSize = boost::filesystem::file_size(szPath); boost::uintmax_t ullNumItems = ullSize / 4; int iNumBatches = ullNumItems / iNumLocalItems; std::vector<int> vNumItemsPerBatch(iNumBatches, iNumLocalItems); int iNumRestItems = ullNumItems % iNumLocalItems; if (iNumRestItems) { vNumItemsPerBatch.push_back(iNumRestItems); } std::cout << "Number of Items : " << ullNumItems << std::endl << "Number of Batches : " << vNumItemsPerBatch.size() << std::endl; boost::shared_ptr<Context> pContext(new Context(iNumSortingThreads)); boost::scoped_ptr<SortingThreadGroup> pSortingThreadGroup(new SortingThreadGroup(pContext)); boost::timer::auto_cpu_timer cTimer; for (int i = 0; i < vNumItemsPerBatch.size(); ++ i) { boost::shared_array<int> aData(new int[vNumItemsPerBatch[i]]); cUnSortedFile.read(reinterpret_cast<char *>(aData.get()), vNumItemsPerBatch[i] * sizeof(int)); Job cJob(i, vNumItemsPerBatch[i], aData); // boost::unique_lock<boost::mutex> cLock(pContext->m_cMutex); if (pContext->m_lJobQueue.size() > iNumSortingThreads * 2) { pContext->m_cEvent.wait(cLock); } pContext->m_lJobQueue.push_back(cJob); } std::cout << std::endl; pContext->m_bHasMoreData = false; pSortingThreadGroup->join_all(); return true; } catch(const std::exception & cE) { std::cerr << cE.what() << std::endl; } catch(...) { std::cerr << __LINE__ << std::endl; } return false; }
第二遍就是k Way Merge Sorting了。这里的思路很简单,直接读取外部的一坨文件,以及维护一个队列,每次从活的最小数字的那一列输出候选者,然后读出下一个放入队列。如果文件读完了,则说明那一路文件流可以丢弃了,队列也相应的变小了。这里当然是单线程的。
The second pass is the single-threaded classical k-Way Merging Sorting.
bool Merge(const char * szPath, int iNumBatches) { try { //TODO : There is the limitation about the max number of opened file in process. // std::vector<boost::shared_ptr<std::ifstream> > vTempFiles; for (int i = 0; i < iNumBatches; ++ i) { char aBuffer[256]; sprintf(aBuffer, "%.06d.tmp", i); boost::shared_ptr<std::ifstream> pTempFile(new std::ifstream(aBuffer, std::ios_base::binary)); assert(pTempFile->is_open()); vTempFiles.push_back(pTempFile); } std::ofstream cSortedFile(szPath, std::ios_base::binary); if (! cSortedFile) { std::cerr << "Can't open " << szPath << " to write. " << std::endl; return false; } // boost::timer::auto_cpu_timer cTimer; std::vector<int> vCache; vCache.reserve(10 * 1024 * 1024); std::vector<int> vQueue; std::vector<boost::shared_ptr<std::ifstream> >::iterator iFile = vTempFiles.begin(); for (; iFile != vTempFiles.end(); ++ iFile) { int iNumber = - 1; if ((* iFile)->read(reinterpret_cast<char *>(& iNumber), sizeof(int))) { vQueue.push_back(iNumber); } } do { std::vector<int>::iterator iMinPos = std::min_element(vQueue.begin(), vQueue.end()); vCache.push_back(* iMinPos); if (vCache.size() == vCache.capacity()) { cSortedFile.write(reinterpret_cast<const char *>(& vCache[0]), vCache.size() * sizeof(int)); vCache.clear(); } iFile = vTempFiles.begin() + (iMinPos - vQueue.begin()); int iNumber = - 1; if ((* iFile)->read(reinterpret_cast<char *>(& iNumber), sizeof(int))) { (* iMinPos) = iNumber; } else { vTempFiles.erase(iFile); vQueue.erase(iMinPos); } } while (vQueue.size()); cSortedFile.write(reinterpret_cast<const char *>(& vCache[0]), vCache.size() * sizeof(int)); return true; } catch(const std::exception & cE) { std::cerr << cE.what() << std::endl; } catch(...) { std::cerr << __LINE__ << std::endl; } return false; }
测试的环境为Xeon E5-2603@1.8G,4个硬件线程,测试设置的Job中的数据长度为80M,每次工作线程需要排序20M个int。西部数据的蓝盘,非SSD,也不是混合硬盘,纯机械硬盘。
Tested by single Xeon E5-2603 CPU at 1.8G with 4 hardware threads, each thread process 20M integers. Using WD blue disk, not SSD,.
第一个Sort遍的时间为19.111468s wall, 52.369536s user + 4.243227s system = 56.612763s CPU (296.2%),CPU效率为296.2%/300% = 98.7%,几乎所有时间都在STL中的std::stable_sort里。
Sorting pass used total 19.11 seconds with 98.7% CPU usage.
第二个Merge遍的时间为33.082600s wall, 29.874191s user + 3.010819s system = 32.885011s CPU (99.4%),主要还是都在磁盘写入和排序。当然这里可以为每个文件流构造一个Cache,也可以显著地提高性能,不过这里有一个问题,一旦牵涉到了Cache,则必然又有内存的占用提升,如果占用过大则又失去了Merge的意义。
这里读者可能有个问题,关于主线程中的不停new,其实从Vista开始Windows的内存分配其实已经是池化的,而且这里根本不是性能瓶颈,只有磁盘IO才是,所以这里可以不需要优化。至于架构上的提升其实也不大,因为这里不是传统的多读取者+单写入着(Multiple Reader+Single Writer)而是多读取者写入者+单写入者(Multiple Reader and Writer + Single Writer),所以在结构上和传统的消费者/生产者的多线程工作方式还是有些不同。未来会尝试Lock-Free的工作方式而不用Mutex,这个是以后的内容了。
The memory allocation in the main thread is not a bottleneck compared with the disk IO and sorting, and the memory allocation is based on pool since Vista, so here we might discard the optimization. Later the Lock-Free architecture might be implemented.
这里有全套代码。
Here is the full code.
1 /** 2 * Multithreading C++ Out of Core Sotring for Massive Data 3 * 4 * Copyright (c) 2013 Bo Zhou<Bo.Schwarzstein@gmail.com> 5 * All rights reserved. 6 * Redistribution and use in source and binary forms, with or without 7 * modification, are permitted provided that the following conditions are met: 8 * 9 * * Redistributions of source code must retain the above copyright 10 * notice, this list of conditions and the following disclaimer. 11 * * Redistributions in binary form must reproduce the above copyright 12 * notice, this list of conditions and the following disclaimer in the 13 * documentation and/or other materials provided with the distribution. 14 * * Neither the name of the University of California, Berkeley nor the 15 * names of its contributors may be used to endorse or promote products 16 * derived from this software without specific prior written permission. 17 */ 18 19 #include <fstream> 20 #include <list> 21 #include <iostream> 22 #include <queue> 23 24 #include <boost/filesystem.hpp> 25 #include <boost/smart_ptr.hpp> 26 #include <boost/thread.hpp> 27 #include <boost/timer/timer.hpp> 28 29 class Job 30 { 31 public: 32 33 Job() 34 : 35 m_iIndex(0), 36 m_iNumItems(0) 37 { 38 } 39 40 Job(int iIndex, int iNumItems, const boost::shared_array<int> & aData) 41 : 42 m_iIndex(iIndex), 43 m_iNumItems(iNumItems), 44 m_aData(aData) 45 { 46 } 47 48 Job(const Job & cCopy) 49 : 50 m_iIndex(cCopy.m_iIndex), 51 m_iNumItems(cCopy.m_iNumItems), 52 m_aData(cCopy.m_aData) 53 { 54 } 55 56 public: 57 58 int m_iIndex; 59 int m_iNumItems; 60 boost::shared_array<int> m_aData; 61 }; 62 63 class Context 64 { 65 public: 66 67 Context(int iNumSortingThread) 68 : 69 m_iNumSortingThread(iNumSortingThread), 70 m_bHasMoreData(true) 71 { 72 } 73 74 public: 75 76 int m_iNumSortingThread; 77 78 bool m_bHasMoreData; 79 80 boost::mutex m_cMutex; 81 boost::condition_variable m_cEvent; 82 83 std::list<Job > m_lJobQueue; 84 }; 85 86 class SortingThread : public boost::thread 87 { 88 public: 89 90 SortingThread(const boost::shared_ptr<Context> & pContext) 91 : 92 m_pContext(pContext), 93 boost::thread(boost::bind(& SortingThread::Sort, this)) 94 { 95 } 96 97 void Sort() 98 { 99 while (1) 100 { 101 if (! m_pContext->m_bHasMoreData) 102 { 103 if (! m_pContext->m_lJobQueue.size()) 104 { 105 break; 106 } 107 } 108 109 Job cJob; 110 { 111 boost::unique_lock<boost::mutex> cLock(m_pContext->m_cMutex); 112 if (m_pContext->m_lJobQueue.size()) 113 { 114 // Get a job. 115 // 116 cJob = m_pContext->m_lJobQueue.front(); 117 m_pContext->m_lJobQueue.pop_front(); 118 } 119 } 120 121 if (cJob.m_iNumItems) 122 { 123 std::stable_sort(cJob.m_aData.get(), cJob.m_aData.get() + cJob.m_iNumItems, std::less<int>()); 124 125 // Write out the sorted data. 126 // 127 char aBuffer[256]; 128 sprintf(aBuffer, "%.06d.tmp", cJob.m_iIndex); 129 std::ofstream cOutput(aBuffer, std::ios_base::binary); 130 cOutput.write(reinterpret_cast<const char *>(cJob.m_aData.get()), cJob.m_iNumItems * sizeof(int)); 131 } 132 133 // Tell the main thread we need more data here. 134 // 135 m_pContext->m_cEvent.notify_one(); 136 } 137 } 138 139 private: 140 141 boost::shared_ptr<Context> m_pContext; 142 }; 143 144 class SortingThreadGroup : public boost::thread_group 145 { 146 public: 147 148 SortingThreadGroup(const boost::shared_ptr<Context> & pContext) 149 : 150 m_pContext(pContext) 151 { 152 for (int i = 0; i < m_pContext->m_iNumSortingThread; ++ i) 153 { 154 SortingThread * pSortingThread = new SortingThread(pContext); 155 add_thread(pSortingThread); 156 } 157 } 158 159 private: 160 161 boost::shared_ptr<Context> m_pContext; 162 }; 163 164 /////////////////////////////////////////////////////////////////////////////////////////////////// 165 166 bool Sort(const char * szPath, int iNumSortingThreads, int iNumLocalItems) 167 { 168 try 169 { 170 // Calculate real size. 171 // 172 std::ifstream cUnSortedFile(szPath, std::ios_base::binary); 173 boost::uintmax_t ullSize = boost::filesystem::file_size(szPath); 174 boost::uintmax_t ullNumItems = ullSize / 4; 175 176 int iNumBatches = ullNumItems / iNumLocalItems; 177 std::vector<int> vNumItemsPerBatch(iNumBatches, iNumLocalItems); 178 int iNumRestItems = ullNumItems % iNumLocalItems; 179 if (iNumRestItems) 180 { 181 vNumItemsPerBatch.push_back(iNumRestItems); 182 } 183 std::cout << "Number of Items : " << ullNumItems << std::endl 184 << "Number of Batches : " << vNumItemsPerBatch.size() << std::endl; 185 186 boost::shared_ptr<Context> pContext(new Context(iNumSortingThreads)); 187 boost::scoped_ptr<SortingThreadGroup> pSortingThreadGroup(new SortingThreadGroup(pContext)); 188 189 boost::timer::auto_cpu_timer cTimer; 190 for (int i = 0; i < vNumItemsPerBatch.size(); ++ i) 191 { 192 boost::shared_array<int> aData(new int[vNumItemsPerBatch[i]]); 193 cUnSortedFile.read(reinterpret_cast<char *>(aData.get()), vNumItemsPerBatch[i] * sizeof(int)); 194 195 Job cJob(i, vNumItemsPerBatch[i], aData); 196 197 // 198 boost::unique_lock<boost::mutex> cLock(pContext->m_cMutex); 199 if (pContext->m_lJobQueue.size() > iNumSortingThreads * 2) 200 { 201 pContext->m_cEvent.wait(cLock); 202 } 203 pContext->m_lJobQueue.push_back(cJob); 204 } 205 std::cout << std::endl; 206 pContext->m_bHasMoreData = false; 207 208 pSortingThreadGroup->join_all(); 209 210 return true; 211 } 212 catch(const std::exception & cE) 213 { 214 std::cerr << cE.what() << std::endl; 215 } 216 catch(...) 217 { 218 std::cerr << __LINE__ << std::endl; 219 } 220 221 return false; 222 } 223 224 /////////////////////////////////////////////////////////////////////////////////////////////////// 225 226 bool Merge(const char * szPath, int iNumBatches) 227 { 228 try 229 { 230 //TODO : There is the limitation about the max number of opened file in process. 231 // 232 std::vector<boost::shared_ptr<std::ifstream> > vTempFiles; 233 for (int i = 0; i < iNumBatches; ++ i) 234 { 235 char aBuffer[256]; 236 sprintf(aBuffer, "%.06d.tmp", i); 237 boost::shared_ptr<std::ifstream> pTempFile(new std::ifstream(aBuffer, std::ios_base::binary)); 238 assert(pTempFile->is_open()); 239 vTempFiles.push_back(pTempFile); 240 } 241 242 std::ofstream cSortedFile(szPath, std::ios_base::binary); 243 if (! cSortedFile) 244 { 245 std::cerr << "Can't open " << szPath << " to write. " << std::endl; 246 return false; 247 } 248 249 // 250 boost::timer::auto_cpu_timer cTimer; 251 252 std::vector<int> vCache; 253 vCache.reserve(10 * 1024 * 1024); 254 255 std::vector<int> vQueue; 256 std::vector<boost::shared_ptr<std::ifstream> >::iterator iFile = vTempFiles.begin(); 257 for (; iFile != vTempFiles.end(); ++ iFile) 258 { 259 int iNumber = - 1; 260 if ((* iFile)->read(reinterpret_cast<char *>(& iNumber), sizeof(int))) 261 { 262 vQueue.push_back(iNumber); 263 } 264 } 265 do 266 { 267 std::vector<int>::iterator iMinPos = std::min_element(vQueue.begin(), vQueue.end()); 268 vCache.push_back(* iMinPos); 269 if (vCache.size() == vCache.capacity()) 270 { 271 cSortedFile.write(reinterpret_cast<const char *>(& vCache[0]), vCache.size() * sizeof(int)); 272 vCache.clear(); 273 } 274 275 iFile = vTempFiles.begin() + (iMinPos - vQueue.begin()); 276 int iNumber = - 1; 277 if ((* iFile)->read(reinterpret_cast<char *>(& iNumber), sizeof(int))) 278 { 279 (* iMinPos) = iNumber; 280 } 281 else 282 { 283 vTempFiles.erase(iFile); 284 vQueue.erase(iMinPos); 285 } 286 287 } while (vQueue.size()); 288 cSortedFile.write(reinterpret_cast<const char *>(& vCache[0]), vCache.size() * sizeof(int)); 289 290 return true; 291 } 292 catch(const std::exception & cE) 293 { 294 std::cerr << cE.what() << std::endl; 295 } 296 catch(...) 297 { 298 std::cerr << __LINE__ << std::endl; 299 } 300 301 return false; 302 } 303 304 int main(int argc, char * argv[]) 305 { 306 int iRet = EXIT_FAILURE; 307 308 // 309 char * szPath = NULL; 310 311 int iNumSortingThreads = 0; 312 int iNumLocalItems = 0; 313 314 int iNumBatches = 0; 315 316 // 317 -- argc, ++ argv; 318 if (argc == 3) 319 { 320 szPath = argv[0]; 321 iNumSortingThreads = atoi(argv[1]); 322 iNumLocalItems = atoi(argv[2]) * 1024 * 1024; 323 if (Sort(szPath, iNumSortingThreads, iNumLocalItems)) 324 { 325 iRet = EXIT_SUCCESS; 326 } 327 } 328 else if (argc == 2) 329 { 330 szPath = argv[0]; 331 iNumBatches = atoi(argv[1]); 332 if (Merge(szPath, iNumBatches)) 333 { 334 iRet = EXIT_SUCCESS; 335 } 336 } 337 338 return iRet; 339 }