在上周的测试中,录制程序继续崩溃了。
废话少说,请出gdb:
(1)首先查看调用堆栈,确定是在哪里崩溃的:
1 (gdb) bt
2 #0 0x080b33ef in COutputTS::run (this=0x9800568) at http://www.cnblogs.com/src/COutputTS.cpp:130
3 #1 0x080c3192 in CThread::run (this=0x98005b0) at http://www.cnblogs.com/src/CThread.cpp:82
4 #2 0x080c3149 in CThread::run1 (this=0x98005b0) at http://www.cnblogs.com/src/CThread.cpp:64
5 #3 0x080c3117 in CThread::run0 (pVoid=0x98005b0) at http://www.cnblogs.com/src/CThread.cpp:48
6 #4 0x005daab5 in start_thread () from /lib/libpthread.so.0
7 #5 0x0027b83e in clone () from /lib/libc.so.6
显示程序是在“COutputTS.cpp”的130行处崩溃了。
这么多线程,是哪个线程的调用造成的呢?
(2)先不急着看代码,继续查看线程状态:
1 (gdb) info threads
2 27 Thread 0x9a6b70 (LWP 8051) 0x00a1c424 in __kernel_vsyscall ()
3 26 Thread 0x923b70 (LWP 7898) 0x00a1c424 in __kernel_vsyscall ()
4 25 Thread 0x964b70 (LWP 7899) 0x00a1c424 in __kernel_vsyscall ()
5 24 Thread 0xb39f7b70 (LWP 7894) 0x00a1c424 in __kernel_vsyscall ()
6 23 Thread 0x69bb70 (LWP 7895) 0x00a1c424 in __kernel_vsyscall ()
7 22 Thread 0x8e2b70 (LWP 7896) 0x00a1c424 in __kernel_vsyscall ()
8 21 Thread 0x553b70 (LWP 7889) 0x00a1c424 in __kernel_vsyscall ()
9 20 Thread 0xb49fab70 (LWP 7887) 0x00a1c424 in __kernel_vsyscall ()
10 19 Thread 0x532b70 (LWP 7888) 0x00a1c424 in __kernel_vsyscall ()
11 18 Thread 0x594b70 (LWP 7890) 0x00a1c424 in __kernel_vsyscall ()
12 17 Thread 0x4f1b70 (LWP 7886) 0x00a1c424 in __kernel_vsyscall ()
13 16 Thread 0xb53fbb70 (LWP 7885) 0x00a1c424 in __kernel_vsyscall ()
14 15 Thread 0xb5dfcb70 (LWP 7884) 0x00a1c424 in __kernel_vsyscall ()
15 14 Thread 0xb67fdb70 (LWP 7883) 0x00a1c424 in __kernel_vsyscall ()
16 13 Thread 0xb71feb70 (LWP 7882) 0x00a1c424 in __kernel_vsyscall ()
17 12 Thread 0xb7bffb70 (LWP 7881) 0x00a1c424 in __kernel_vsyscall ()
18 10 Thread 0x4f8db70 (LWP 7879) 0x00a1c424 in __kernel_vsyscall ()
19 8 Thread 0x3b8bb70 (LWP 7877) 0x00a1c424 in __kernel_vsyscall ()
20 7 Thread 0x251fb70 (LWP 7876) 0x00a1c424 in __kernel_vsyscall ()
21 * 6 Thread 0x1b1eb70 (LWP 7874) 0x080b33ef in COutputTS::run (this=0x9804758) at http://www.cnblogs.com/src/COutputTS.cpp:130
22 5 Thread 0x318ab70 (LWP 7873) 0x00a1c424 in __kernel_vsyscall ()
23 4 Thread 0x5b8db70 (LWP 7872) 0x080b33ef in COutputTS::run (this=0x9800568) at http://www.cnblogs.com/src/COutputTS.cpp:130
24 3 Thread 0x71e8b70 (LWP 7871) 0x00a1c424 in __kernel_vsyscall ()
25 1 Thread 0xb7fe1770 (LWP 7837) 0x00a1c424 in __kernel_vsyscall ()
可以看到,是线程6在调用到“COutputTS.cpp:130”时崩溃的。
(3)这部分的代码如下:
//写数据
if(m_pBuff->count() <= 0)
{
LOG_PERIOD(LOG_TYPE_INFO, "Buffer Empty.Channel:[%d-%s], BufferCount:%d\n", m_pChannel->m_nChannelID, m_pChannel->m_strName.c_str(), m_pBuff->count());
usleep(10);
continue;
}
CPoolNode* node = m_pBuff->pop();
if (node != NULL)
{
if(FileSave.Write(node->m_pData, node->m_nDataLen) < node->m_nDataLen)
{
//LOG(LOG_TYPE_ERROR, "%s", FileSave.GetError().c_str());
}
m_pBuff->freeNode(node);
}
先解释逻辑:
m_pBuff:是一个队列缓冲区,存放的是对象指针;每个节点实际另一个内存池维护的节点。
这部分代码先从m_pBuff中取出一个可用节点,将内容写入文件后再“释放”掉。
注意:这里的“释放”并不是真正的delete,而是归还到内存池中供下次使用。
现在来查看出问题时的现场:
130行指向的是“FileSave.Write(node->m_pData, node->m_nDataLen) < node->m_nDataLen)”,我们来查看相关变量值:
1 (gdb) thread 6
2 [Switching to thread 6 (Thread 0x1b1eb70 (LWP 7874))]#0 0x080b33ef in COutputTS::run (this=0x9804758) at http://www.cnblogs.com/src/COutputTS.cpp:130
3 130 in http://www.cnblogs.com/src/COutputTS.cpp
4
5 (gdb) p node
6 $9 = (CPoolNode *) 0x73
可以变量node指向的内存地址是:0x73,该地址是一个低地址,应该是收系统保护,无法被程序访问的,所以程序运行到这里就会出错了。
关于低地址,多说一句:
一个典型的Linux C程序内存空间由如下几部分组成:
- 代码段(.text)。这里存放的是CPU要执行的指令。代码段是可共享的,相同的代码在内存中只会有一个拷贝,同时这个段是只读的,防止程序由于错误而修改自身的指令。
- 初始化数据段(.data)。这里存放的是程序中需要明确赋初始值的变量,例如位于所有函数之外的全局变量:int val=100。需要强调的是,以上两段都是位于程序的可执行文件中,内核在调用exec函数启动该程序时从源程序文件中读入。
- 未初始化数据段(.bss)。位于这一段中的数据,内核在执行该程序前,将其初始化为0或者null。例如出现在任何函数之外的全局变量:int sum;
- 堆(Heap)。这个段用于在程序中进行动态内存申请,例如经常用到的malloc,new系列函数就是从这个段中申请内存。
- 栈(Stack)。函数中的局部变量以及在函数调用过程中产生的临时变量都保存在此段中。
很显然,我们从队列m_pBuff中pop出来的这个指针0x73是不对的,因为我们在push的时候是不可能压入这么一个地址的。
那好,干脆把整个m_pBuff都打出来看看:
1 (gdb) p m_pBuff[0]
2 $8 = {<CBufferManager<CPoolNode*>> = {_vptr.CBufferManager = 0x80cfad8, m_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __kind = 0,
3 __nusers = 0, {__spins = 0, __list = {__next = 0x0}}}, __size = '\000' <repeats 23 times>, __align = 0},
4 m_queue = std::queue wrapping: std::deque with -140 elements = {
5 0x0, 0x31313032, 0x2d31312d, 0x31203430, 0x37313a38, 0x2c35343a, 0x3a313930, 0x464e4920, 0x3a20204f, 0x63657220, 0x20646576, 0x72616c61, 0x315b3a6d, 0x2e302e30, 0x322e3036, 0x332d312d, 0x5d353039, 0x7079742c, 0x2c323a65, 0x61747320, 0x69547472, 0x313a656d, 0x34303233, 0x36383130, 0x72202c35, 0x54766365, 0x3a656d69, 0x30323331, 0x38313034, 0xa3536, 0x8c94f10, 0x8d, 0x73, 0x73, 0x0, 0x31313032, 0x2d31312d, 0x31203430, 0x37313a38, 0x2c36343a, 0x3a333131, 0x464e4920, 0x3a20204f, 0x63657220, 0x20646576, 0x72616c61, 0x315b3a6d, 0x2e302e30, 0x322e3036, 0x332d312d, 0x5d353039, 0x7079742c, 0x2c313a65, 0x61747320, 0x69547472, 0x313a656d, 0x34303233, 0x36383130, 0x72202c36, 0x54766365, 0x3a656d69, 0x30323331, 0x38313034, 0xa3636, 0x84d8320, 0x8d, 0x73, 0x73,0x0, 0x31313032, 0x2d31312d, 0x31203430, 0x37313a38, 0x2c37343a, 0x3a333331, 0x464e4920, 0x3a20204f, 0x63657220, 0x20646576, 0x72616c61, 0x315b3a6d, 0x2e302e30, 0x322e3036, 0x332d312d, 0x5d353039, 0x7079742c, 0x2c323a65, 0x61747320, 0x69547472, 0x313a656d, 0x34303233, 0x36383130, 0x72202c37, 0x54766365, 0x3a656d69, 0x30323331, 0x38313034, 0xa3736, 0x8d9f9c8, 0x71, 0xb7d000a8, 0xb7dc4778, 0xffffffff, 0x76636572, 0x61206465, 0x6d72616c, 0x30315b3a, 0x362e302e, 0x2d322e30, 0x39332d31, 0x2c5d3530, 0x65707974, 0x202c323a, 0x72617473, 0x6d695474, 0x33313a65, 0x30343032, 0x39363831, 0x6572202c, 0x69547663, 0x313a656d, 0x34303233, 0x36383130, 0x8cf0039, 0x8e078d8, 0x8cc3fd8, 0xb7d6e2b8, 0x8c47608, 0xb7d18468, 0x8af6088, 0xb7d05680, 0x8346bd8, 0x89e6088, 0xb7d4de18, 0x89751b0, 0x8b61e98, 0x84beee0, 0x8ccb020, 0x8b87720, 0x89645f0, 0x8a642a8, 0x85181b0, 0x8a86bf0, 0x8155640, 0x834a598, 0x8880588, 0x8dafb08, 0x8d27770, 0x8ad0d40, 0x8bb2e28, 0x89c80c0, 0xb7d05710, 0x898f278, 0x8d40670, 0x8b770a0, 0x8a0b518, 0x8d34eb0, 0x8b7a520, 0x8836438, 0xb7d055d8, 0x835f598, 0x8cd8220, 0x8decf98, 0xb7d70cb8, 0xb7da2c90, 0x8824170, 0xb283d5e8, 0x8a93370, 0xb7d63dc0, 0x8d7c600, 0x8bf31f8, 0xb282e468, 0x8d77740, 0x8b2d490, 0x8aec100, 0xb2ad7b68, 0x8b02d48, 0x868a1a8, 0x8890c08, 0x8b93960, 0xb7db1c30, 0x88270b0, 0x8b79560, 0xb2a06618, 0x88b7990, 0x8cefe28, 0xb7dfffb0, 0x8b46e10, 0x8b40510, 0x89864b8, 0x8857880, 0x8d26cf0, 0x8c4e448, 0x8b2cf50, 0x830f570, 0xb7da41a8, 0x8aa7af8, 0x8b9b760, 0x889a448, 0x8527db0...}}, m_pool = 0x8140388}
发现问题没有?
队列m_pBuff中居然出现了一系列的可疑低地址:
“0x0, 0x31313032, 0x2d31312d, 0x31203430, 0x37313a38, 0x2c35343a, 0x3a313930”
更可疑的是,还有“0x0”,“0xffffffff”等无效地址,更奇怪的事队列中还有重复的元素。
总之,这块内存太可疑了。
问题是,为什么会变成这样呢?
在检查了向队列m_pBuff中push节点的代码后,我确认push进m_pBuff的应该都是正确的内存地址,因为程序不可能取到一个类似“0x3a313930”这样的低地址,然后再把它push到队列中的。
在跟头儿讨论后,他认为这块内存应该是被程序其他部分给“踩”了,即队列中的内容被其他指针给错误的覆盖掉了,他建议我从程序其他部分查找原因。
但是,茫茫代码,该从哪里去查呢?
我犯难了。