在使用c/c++这种没有内存管理机制的语言时,我们都会很注意内存的使用,常见的内存问题如:缓冲区(堆栈)溢出,内存泄露,空指针解引用,双重释放(double-free)等。
而在编写极消耗内存的程序时,我们还需要考虑是否会不够内存空间,例如最近在静态分析中的指针分析,就很消耗内存。一般来说,这个内存是指动态分配释放的堆区,对于这种内存在分配时如果不够会被系统捕获并抛出异常,像在Linux的OOM(out of memory)机制,像llvm这种对内存分配进一步封装还会有更友好的提示。
那如果是栈空间不够呢?
栈空间不够这个问题其实离我们不遥远,因为这里说的栈其实就是函数栈,我们都知道递归函数如果没有设置正确的递归终止条件则可能会无限递归,然后触发系统的异常保护机制。那递归函数的最大递归深度是多少呢?显然跟具体的函数有关,因为函数中的局部变量也存储在栈区,即使声明一个指针变量也是存在栈区的,只不过其指向的位置可能不是栈区罢了;而且像x86下的函数调用约定,参数也是通过栈来传递的。所以最大递归深度显然不是一个永恒不变的绝对值。下面展示的代码和结果是在ubuntu 18.04的默认环境运行的:
1 引言
1.1 无限递归
这是一个无限递归的例子,全局变量cnt记录add
函数的调用次数,函数add将递归计算从a加到无穷大(不考虑整数溢出的话),我们在add中输出cnt并将其自增,以此来作为递归调用的深度,但其实这样不能完全说明函数栈最大深度,因为add还调用了printf函数,也会占用函数栈(虽然它会在返回后退栈)。运行后很快就会出现segmentation fault (core dumped),此时cnt为261788。
// example1.c
#include <stdio.h>
unsigned long cnt = 0;
int add(int a)
{
printf("add cnt: %ld
", cnt++);
return a + add(a+1);
}
int main()
{
printf("%d
", add(0));
return 0;
}
/* [Output:]
add cnt: 261788
[1] 31440 segmentation fault (core dumped) ./a.out
*/
和上一个例子相比,下面这份代码在add函数中声明了一个1kb大小的local[]变量,再运行,同样很快出现segmentation fault (core dumped),但此时cnt为7817,显然比上面的261788小的多。这是因为函数里的1kb的local[]变量是局部变量,每次调用add都会比上面代码多开占用栈区,所以递归调用的深度自然要小很多。(这里多说一句,不能写成char *local = "Stack Bomb:";
,前者字符串是栈区,后者是常量区,而且也不是1kb大小,具体可以多阅读c语言指针相关的资料)。
// example2.c
#include <stdio.h>
unsigned long cnt = 0;
int add(int a)
{
char local[1024] = "Stack Bomb:";
printf("%s add cnt: %ld
", local, cnt++);
return a + add(a+1);
}
int main()
{
printf("%d
", add(0));
return 0;
}
/* [Output:]
Stack Bomb: add cnt: 7817
[1] 19529 segmentation fault (core dumped) ./a.out
*/
但是上面的逻辑要想成立是有一个前提条件的:栈区大小的是固定的!例如两个程序运行时,系统分配的栈都是n kb大小,否者cnt的大小关系将变得毫无意义。实际上在Linux中确实是这样的,我们以ubuntu 18.04的默认配置为例。这个函数栈的大小限制可以通过ulimit -s
(-s: stack size (kbytes))查看,默认是8192kb,即8Mb。这个其实通过上述两个例子也能隐约猜到,尤其是example2.c,一个1k的局部变量,递归调用深度7800多,和8192kb这个数字也比较吻合。
所以如果这个栈大小可以调的话,我们运行上述两个例子后的cnt应该是有增加的。确实如此,例如通过ulimit -s 16384
调成16Mb再运行example1.c,其输出的cnt为523834,接近2倍,也符合预期。
1.2 函数栈不能无限增长
所以从本质上来说,递归调用无非是系统限制了函数栈的无限增长,与调用什么函数是无关的,假设一个程序的函数调用链很长,且其中还有较大的局部变量,它将和无限递归的情况是一样的。
1.3 调试难度来源
这种core dump不好调试,因为core dump信息是不固定,至少我目前遇到的好几个,都不太一样,在gdb中会发现他既不是空指针,而地址有时还是能访问的,就很迷人。下面展示了栈区大小为8M时,上述example1和2的coredump函数调用栈信息,实际上,同样一个程序,在不同栈区大小的环境下运行,得到的bt也不一定是一样的。
- example1.c - 8192kb stack size - gdb ./a.out core - bt
#0 0x00007efe6efae268 in _IO_new_file_write (
f=0x7efe6f30f760 <_IO_2_1_stdout_>, data=0x56345c59a260, n=16)
at fileops.c:1196
1196 fileops.c: No such file or directory.
(gdb) bt
#0 0x00007efe6efae268 in _IO_new_file_write (
f=0x7efe6f30f760 <_IO_2_1_stdout_>, data=0x56345c59a260, n=16)
at fileops.c:1196
#1 0x00007efe6efb0021 in new_do_write (to_do=16,
data=0x56345c59a260 "add cnt: 261824
",
fp=0x7efe6f30f760 <_IO_2_1_stdout_>) at fileops.c:457
#2 _IO_new_do_write (fp=0x7efe6f30f760 <_IO_2_1_stdout_>,
data=0x56345c59a260 "add cnt: 261824
", to_do=16) at fileops.c:433
#3 0x00007efe6efaeabd in _IO_new_file_xsputn (
f=0x7efe6f30f760 <_IO_2_1_stdout_>, data=<optimized out>, n=1)
at fileops.c:1266
#4 0x00007efe6ef7eaaa in _IO_vfprintf_internal (
s=0x7efe6f30f760 <_IO_2_1_stdout_>,
format=0x56345c091744 "add cnt: %ld
", ap=ap@entry=0x7ffe3d2f0630)
at vfprintf.c:1674
#5 0x00007efe6ef88016 in __printf (format=<optimized out>) at printf.c:33
#6 0x000056345c09167b in add ()
#7 0x000056345c091688 in add ()
......
- example2.c - 8192kb stack size - gdb ./a.out core - bt
Program terminated with signal SIGSEGV, Segmentation fault.
#0 __find_specmb (format=0x560aad235814 "%s add cnt: %ld
")
at printf-parse.h:108
108 printf-parse.h: No such file or directory.
(gdb) bt
#0 __find_specmb (format=0x560aad235814 "%s add cnt: %ld
")
at printf-parse.h:108
#1 _IO_vfprintf_internal (s=0x7f2d0fc51760 <_IO_2_1_stdout_>,
format=0x560aad235814 "%s add cnt: %ld
", ap=ap@entry=0x7ffd91422530)
at vfprintf.c:1320
#2 0x00007f2d0f8ca016 in __printf (format=<optimized out>) at printf.c:33
#3 0x0000560aad23572e in add ()
#4 0x0000560aad23572e in add ()
......
2 问题的发现、定位、修复
我最初遇到这个问题的时候,是写C++程序,segmentation fault后使用gdb调试core文件时,bt的第0层显示的是malloc.c: No such file or directory信息,加了些log后又变成了Cannot access memory at address 0x7ffd05543ff8这一类……进一步来说,运行了好几次,发现触发位置和时机是不一样,但有似乎有什么特点。例如我在很多函数里都有用到的stl,像set,map之类的,而容器里存的当然都是指针啦,毕竟是大程序,要注意的是虽然c++的stl容器默认基于堆管理内存,但在一个函数里声明一个局部容器时,容器本身这个变量是存储在栈区的,这个和上述指针是同理的。言归正传,大多时候bt都是在初始化或者是insert操作时,遇到这种情况我的第一反应是容器的里的元素(即指针)有问题,但是打了一下log,发现是可访问的,这就很奇怪。
Program received signal SIGSEGV, Segmentation fault.
_int_malloc (source=0x7ffff7201740 <getNormalFlowFunction>, bytes=112) at malloc.c:3570
3570 malloc.c: No such file or directory.
(gdb) bt
#0 _int_malloc (source=0x7ffff7201740 <getNormalFlowFunction>, bytes=112) at malloc.c:3570
#1 0x00007ffff6ecbfb5 in __GI___libc_malloc (bytes=112) at malloc.c:2924
......
在Stack Overflow上,搜malloc.c: No such file or directory,看到不少回到都是没能直接指出代码的问题,比较靠谱的回到都是建议用valgrind检测一下是否有内存泄露等问题。于是我就配置环境跑了跑valgrind:valgrind --tool=memcheck --leak-check=full <exec>
,得到了如下信息(部分):
......
==6118== Stack overflow in thread #1: can't grow stack to 0x1ffe801000
==6118==
==6118== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==6118== Access not within mapped region at address 0x1FFE801EE8
==6118== Stack overflow in thread #1: can't grow stack to 0x1ffe801000
==6118== at 0x21A85CA: readAbbreviatedField(llvm::BitstreamCursor&, llvm::BitCodeAbbrevOp const&) (in <exec>)
==6118== If you believe this happened as a result of a stack
==6118== overflow in your program's main thread (unlikely but
==6118== possible), you can try to increase the size of the
==6118== main thread stack using the --main-stacksize= flag.
==6118== The main thread stack size used in this run was 8388608.
==6118== Stack overflow in thread #1: can't grow stack to 0x1ffe801000
==6118==
==6118== Process terminating with default action of signal 11 (SIGSEGV)
==6118== Access not within mapped region at address 0x1FFE801ED8
==6118== Stack overflow in thread #1: can't grow stack to 0x1ffe801000
==6118== at 0x4A2C650: _vgnU_freeres (in /usr/lib/valgrind/vgpreload_core-amd64-linux.so)
==6118== If you believe this happened as a result of a stack
==6118== overflow in your program's main thread (unlikely but
==6118== possible), you can try to increase the size of the
==6118== main thread stack using the --main-stacksize= flag.
==6118== The main thread stack size used in this run was 8388608.
==6118==
==6118== HEAP SUMMARY:
==6118== in use at exit: 2,530,649,161 bytes in 11,931,838 blocks
==6118== total heap usage: 73,308,707 allocs, 61,376,869 frees, 64,658,848,337 bytes allocated
==6118==
==6118== 64 bytes in 1 blocks are possibly lost in loss record 556 of 2,549
==6118== at 0x4C31B0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==6118== by 0x71A630: llvm::safe_malloc(unsigned long) (MemAlloc.h:26)
==6118== by 0x739DC1: llvm::SmallVectorTemplateBase<std::pair<void*, unsigned long>, false>::grow(unsigned long) (SmallVector.h:240)
==6118== by 0x7317CF: llvm::SmallVectorTemplateBase<std::pair<void*, unsigned long>, false>::push_back(std::pair<void*, unsigned long>&&) (SmallVector.h:220)
==6118== by 0x72B1EC: llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator, 4096ul, 4096ul>::Allocate(unsigned long, unsigned long) (Allocator.h:249)
==6118== by 0x7A1ACE: llvm::AllocatorBase<llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator, 4096ul, 4096ul> >::Allocate(unsigned long, unsigned long) (Allocator.h:59)
==6118== by 0x7AB5F5: clang::CFGBlock** llvm::AllocatorBase<llvm::BumpPtrAllocatorImpl<llvm::MallocAllocator, 4096ul, 4096ul> >::Allocate<clang::CFGBlock*>(unsigned long) (Allocator.h:81)
==6118== by 0x7A6344: clang::BumpVector<clang::CFGBlock*>::grow(clang::BumpVectorContext&, unsigned long) (BumpVector.h:233)
==6118== by 0x79FBBB: clang::BumpVector<clang::CFGBlock*>::push_back(clang::CFGBlock* const&, clang::BumpVectorContext&) (BumpVector.h:166)
==6118== by 0x78414B: clang::CFG::createBlock() (CFG.cpp:4804)
==6118== by 0x77866E: (anonymous namespace)::CFGBuilder::createBlock(bool) (CFG.cpp:1544)
==6118== by 0x780042: (anonymous namespace)::CFGBuilder::VisitWhileStmt(clang::WhileStmt*) (CFG.cpp:3665)
==6118==
==6118== 64 bytes in 1 blocks are possibly lost in loss record 557 of 2,549
==6118== at 0x4C31B0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
.....
分析一下,这个log后面部分很多,看着都像是堆内存泄露的问题,这是因为程序coredump了,之前申请的动态内存没能正确回收,所以参考价值不大。关键在于log的一开始,抛出了一个Stack overflow信息,其中最让我顿悟的一句是Stack overflow in thread #1: can't grow stack to 0x1ffe801000,直译过来就是栈无法生长了。通过简单的搜索后,才有了上面的分析。我尝试了调到stack size,发现程序就能正常运行了。
为了进一步验证是栈大小不足引起的问题,我打算做一下profile,看看程序运行时的内存使用峰值,尤其是栈的峰值,同样可以用valgrind进行测试:valgrind --tool=massif --stacks=yes <exec>
,之后会生成一个massif.*的文件,可以使用ms_print massif.*file
输出统计信息。发现我的程序运行的最大栈区使用峰值差不多到9Mb。
#-----------
snapshot=26
#-----------
time=3341115043485
mem_heap_B=2796240754
mem_heap_extra_B=234456806
mem_stacks_B=8992984
heap_tree=empty
至此可以确定确实是因为程序栈空间不足导致栈溢出引发的segmentation fault。
附录 No such file or directory
通过上面例子的log发现,虽然bt不尽相同,但似乎第0层都是在malloc.c等glibc库中,即在libc库中异常被捕获,毕竟这些库有较为完善的异常处理机制。如果想要调试这些库,但是又有No such file or directory,那解决方案当然是装源码并告诉gdb正确的路径,以malloc.c: No such file or directory为例
sudo apt-get install libc6-dbg
sudo apt install glibc-source
cd /usr/src/glibc
# ls
sudo tar xvf glibc-2.27.tar.xz
# ls
cd glibc-2.27/malloc
# ls
# vim malloc.c
部分参考链接
Leokie 设置c++程序的堆栈空间解决栈溢出问题
C/C++调试、跟踪及性能分析工具综述
Stack Overflow: malloc causing segmentation fault by _int_malloc
Stack Overflow: Valgrind reporting possible stack overflow - what does it mean?
Stack Overflow: Include source code of malloc.c in gdb?