事情的起因是这样的,我最近在给silly增加prometheus数据库支持。
在测试过程中发现,在docker中,silly刚起动就占了将近110MiB虚拟内存。
我将相同的代码在宿主机直接直行,虚拟内存只有48.32MiB。
与此同时, silly暴露给prometheus数据库的指标显示,应用程序分配了3.7MiB内存,而jemalloc一共给应用程序分配了9.38 MiB内存。
我打算先来看看这48.32MiB内存是不是合理的。
通过cat /proc/[pid]/smaps
查看了一下虚拟内存的大致分配,下面是一些比较大的匿名内存段。
7ffff587e000-7ffff607e000 rw-p 00000000 00:00 0
Size: 8192 kB
...
7ffff607f000-7ffff687f000 rw-p 00000000 00:00 0
Size: 8192 kB
...
7ffff6880000-7ffff7080000 rw-p 00000000 00:00 0
Size: 8192 kB
...
7ffff7080000-7ffff7c00000 rw-p 00000000 00:00 0
Size: 11776 kB
其中三个8192KiB内存块肯定是三个线程栈所分配的虚拟内存,一共24MiB。
而11776KiB这块内存大小和jemalloc向应用程序分配的内存大小非常接近(稍大一些), 有理由怀疑这块内存其实就是jemalloc所使用的虚拟内存大小,毕竟jemalloc本身的元数据也是需要一些内存消耗的。
为了确认这一怀疑,我使用strace -f -k -e mmap ./silly/silly server-src/config
来确认这一点(事实上这一步我遇到了困境,因为我没有加-f标志,导致strace不能跟踪所有线程的系统调用,以致于浪费了很多时间)。
下面是截取的一段jemalloc相关的mmap系统调用,可以看到这块内存刚好就是jemalloc所用掉的。
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffff7fcf000
> /usr/lib/x86_64-linux-gnu/libc-2.28.so(mmap64+0x43) [0xf3cd3]
> /home/silly/silly(je_pages_boot+0x1c7) [0xb0877]
> /home/silly/silly(malloc_init_hard_a0_locked+0x12d) [0x4f20d]
> /home/silly/silly(malloc_init_hard+0x79) [0x4f4d9]
> /home/silly/silly(__libc_csu_init+0x45) [0xc3895]
> /usr/lib/x86_64-linux-gnu/libc-2.28.so(__libc_start_main+0x7a) [0x2402a]
> /home/silly/silly(_start+0x2a) [0x1dcea]
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ffff7fcf000
> /usr/lib/x86_64-linux-gnu/libc-2.28.so(mmap64+0x43) [0xf3cd3]
> /home/silly/silly(je_pages_boot+0x2b7) [0xb0967]
> /home/silly/silly(malloc_init_hard_a0_locked+0x12d) [0x4f20d]
> /home/silly/silly(malloc_init_hard+0x79) [0x4f4d9]
> /home/silly/silly(__libc_csu_init+0x45) [0xc3895]
> /usr/lib/x86_64-linux-gnu/libc-2.28.so(__libc_start_main+0x7a) [0x2402a]
> /home/silly/silly(_start+0x2a) [0x1dcea]
mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ffff7a49000
> /usr/lib/x86_64-linux-gnu/libc-2.28.so(mmap64+0x43) [0xf3cd3]
> /home/silly/silly(je_pages_map+0x4c) [0xb00cc]
> /home/silly/silly(je_extent_alloc_mmap+0x14) [0xa56c4]
> /home/silly/silly(base_block_alloc.isra.21+0x281) [0x63681]
> /home/silly/silly(je_base_new+0x70) [0x64030]
> /home/silly/silly(je_base_boot+0x17) [0x64b97]
> /home/silly/silly(malloc_init_hard_a0_locked+0x138) [0x4f218]
> /home/silly/silly(malloc_init_hard+0x79) [0x4f4d9]
> /home/silly/silly(__libc_csu_init+0x45) [0xc3895]
> /usr/lib/x86_64-linux-gnu/libc-2.28.so(__libc_start_main+0x7a) [0x2402a]
> /home/silly/silly(_start+0x2a) [0x1dcea]
mmap(NULL, 4190208, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ffff784a000
> /usr/lib/x86_64-linux-gnu/libc-2.28.so(mmap64+0x43) [0xf3cd3]
> /home/silly/silly(je_pages_map+0x163) [0xb01e3]
> /home/silly/silly(je_extent_alloc_mmap+0x14) [0xa56c4]
> /home/silly/silly(base_block_alloc.isra.21+0x281) [0x63681]
> /home/silly/silly(je_base_new+0x70) [0x64030]
> /home/silly/silly(je_base_boot+0x17) [0x64b97]
> /home/silly/silly(malloc_init_hard_a0_locked+0x138) [0x4f218]
> /home/silly/silly(malloc_init_hard+0x79) [0x4f4d9]
> /home/silly/silly(__libc_csu_init+0x45) [0xc3895]
> /usr/lib/x86_64-linux-gnu/libc-2.28.so(__libc_start_main+0x7a) [0x2402a]
> /home/silly/silly(_start+0x2a) [0x1dcea]
mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ffff7800000
> /usr/lib/x86_64-linux-gnu/libc-2.28.so(mmap64+0x43) [0xf3cd3]
> /home/silly/silly(je_pages_map+0x4c) [0xb00cc]
> /home/silly/silly(je_extent_alloc_mmap+0x14) [0xa56c4]
> /home/silly/silly(je_ehooks_default_alloc_impl+0xd3) [0x9de03]
> /home/silly/silly(je_ecache_alloc_grow+0x9b9) [0xa5039]
> /home/silly/silly(pac_alloc_real+0x93) [0xaf3e3]
> /home/silly/silly(pac_alloc_impl+0xe5) [0xaf525]
> /home/silly/silly(je_pa_alloc+0x57) [0xae547]
> /home/silly/silly(je_arena_extent_alloc_large+0x9f) [0x5b56f]
> /home/silly/silly(je_large_palloc+0xcb) [0xac53b]
> /home/silly/silly(je_arena_palloc+0xd5) [0x5ede5]
> /home/silly/silly(je_tsd_tcache_data_init+0xe2) [0xbc2a2]
> /home/silly/silly(je_tsd_tcache_enabled_data_init+0x28) [0xbc948]
> /home/silly/silly(je_tsd_fetch_slow+0x114) [0xbe3d4]
> /home/silly/silly(malloc_init_hard+0x9b) [0x4f4fb]
> /home/silly/silly(__libc_csu_init+0x45) [0xc3895]
> /usr/lib/x86_64-linux-gnu/libc-2.28.so(__libc_start_main+0x7a) [0x2402a]
> /home/silly/silly(_start+0x2a) [0x1dcea]
mmap(NULL, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ffff7400000
> /usr/lib/x86_64-linux-gnu/libc-2.28.so(mmap64+0x43) [0xf3cd3]
> /home/silly/silly(je_pages_map+0x4c) [0xb00cc]
> /home/silly/silly(je_extent_alloc_mmap+0x14) [0xa56c4]
> /home/silly/silly(base_block_alloc.isra.21+0x281) [0x63681]
> /home/silly/silly(je_base_alloc+0x269) [0x644d9]
> /home/silly/silly(je_rtree_leaf_elm_lookup_hard+0x100) [0xb5280]
> /home/silly/silly(je_emap_register_boundary+0x357) [0x9ea17]
> /home/silly/silly(je_ecache_alloc_grow+0x60b) [0xa4c8b]
> /home/silly/silly(pac_alloc_real+0x93) [0xaf3e3]
> /home/silly/silly(pac_alloc_impl+0xe5) [0xaf525]
> /home/silly/silly(je_pa_alloc+0x57) [0xae547]
> /home/silly/silly(je_arena_extent_alloc_large+0x9f) [0x5b56f]
> /home/silly/silly(je_large_palloc+0xcb) [0xac53b]
> /home/silly/silly(je_arena_palloc+0xd5) [0x5ede5]
> /home/silly/silly(je_tsd_tcache_data_init+0xe2) [0xbc2a2]
> /home/silly/silly(je_tsd_tcache_enabled_data_init+0x28) [0xbc948]
> /home/silly/silly(je_tsd_fetch_slow+0x114) [0xbe3d4]
> /home/silly/silly(malloc_init_hard+0x9b) [0x4f4fb]
> /home/silly/silly(__libc_csu_init+0x45) [0xc3895]
> /usr/lib/x86_64-linux-gnu/libc-2.28.so(__libc_start_main+0x7a) [0x2402a]
> /home/silly/silly(_start+0x2a) [0x1dcea]
mmap(NULL, 3670016, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ffff7080000
> /usr/lib/x86_64-linux-gnu/libc-2.28.so(mmap64+0x43) [0xf3cd3]
> /home/silly/silly(je_pages_map+0x4c) [0xb00cc]
> /home/silly/silly(je_extent_alloc_mmap+0x14) [0xa56c4]
> /home/silly/silly(je_ehooks_default_alloc_impl+0xd3) [0x9de03]
> /home/silly/silly(je_ecache_alloc_grow+0x9b9) [0xa5039]
> /home/silly/silly(pac_alloc_real+0x93) [0xaf3e3]
> /home/silly/silly(pac_alloc_impl+0xe5) [0xaf525]
> /home/silly/silly(je_pa_alloc+0x57) [0xae547]
> /home/silly/silly(je_arena_extent_alloc_large+0x9f) [0x5b56f]
> /home/silly/silly(je_large_palloc+0xcb) [0xac53b]
> /home/silly/silly(je_arena_malloc_hard+0x50d) [0x5ec0d]
> /home/silly/silly(je_malloc_default+0x5a1) [0x52371]
> /home/silly/silly(silly_malloc+0x6) [0x23356]
> /home/silly/silly(silly_socket_init+0x57) [0x215e7]
> /home/silly/silly(silly_run+0x88) [0x22cb8]
> /home/silly/silly(main+0x468) [0x1d888]
> /usr/lib/x86_64-linux-gnu/libc-2.28.so(__libc_start_main+0xeb) [0x2409b]
> /home/silly/silly(_start+0x2a) [0x1dcea]
有了宿主机的这些经验后,再去查看docker中那110MiB内存来源时就轻车熟路了。
首先使用docker run --rm --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -it --entrypoint /bin/bash registry.cn-hangzhou.aliyuncs.com/findstr-vps/xxx
来强行改写entrypoint为bash和添加strace的能力,之后所有的分析都和宿主机一样了。
通过cat /proc/[pid]/smaps
发现了一大块内存一个字节都没有使用,他很可疑.
7ffff0021000-7ffff4000000 ---p 00000000 00:00 0
Size: 65404 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Rss: 0 kB
Pss: 0 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 0 kB
Anonymous: 0 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
Locked: 0 kB
THPeligible: 1
VmFlags: mr mw me nr sd
再使用strace -f -k -e mmap ./silly/silly server-conf/config
, 可以找到相应的callstack如下:
[pid 34] mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7fffed87d000
> /lib/x86_64-linux-gnu/libc-2.28.so(mmap64+0x43) [0xf3cd3]
> /lib/x86_64-linux-gnu/libc-2.28.so(_IO_str_seekoff+0xd87) [0x80127]
> /lib/x86_64-linux-gnu/libc-2.28.so(_IO_str_seekoff+0x157c) [0x8091c]
> /lib/x86_64-linux-gnu/libc-2.28.so(_IO_str_seekoff+0x44ad) [0x8384d]
> /lib/x86_64-linux-gnu/libc-2.28.so(__libc_malloc+0x116) [0x84626]
> /lib/x86_64-linux-gnu/libc-2.28.so(fgets+0x1bb) [0x7022b]
> /silly/silly(luaL_loadfilex+0x4d) [0x3a35d]
> /silly/silly(silly_worker_start+0x19b) [0x2200b]
> /silly/silly(thread_worker+0xb) [0x228fb]
> /lib/x86_64-linux-gnu/libpthread-2.28.so(start_thread+0xf3) [0x7fa3]
> /lib/x86_64-linux-gnu/libc-2.28.so(clone+0x3f) [0xf906f]
可以看到是libc中的fgets函数导致的,当我使用gdb断这个函数时,相同的系统调用又会出现在别的线程中调用的glibc函数中。
我怀疑这是glibc的一个bug。报着试试看的态度google了一下,没想到这是一个Feature(glibc 2.10+版本的多线程程序,glibc会预分配很多虚拟内存, 用来提高性能)。
但是奇怪的时,我的宿主机和docker的glibc的版本都是2.28, 宿主机却没有虚拟内存问题。
ldd (Debian GLIBC 2.28-10+deb10u2) 2.28
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.
7月30号补充:
在分析过程中,我还发现一个奇怪的现象。
mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ffff7a49000
产出了块大内存,但是从smaps中发现,这块内存包含了文件/home/silly/silly/luaclib/sys.so
的内存映射, 截取部分smaps如下:
7ffff7080000-7ffff7c00000 rw-p 00000000 00:00 0
Size: 11776 kB
7ffff7c23000-7ffff7c27000 r--p 00000000 08:00 253822 /home/silly/silly/luaclib/sys.so
Size: 16 kB
7ffff7c27000-7ffff7c3d000 r-xp 00004000 08:00 253822 /home/silly/silly/luaclib/sys.so
Size: 88 kB
7ffff7c3d000-7ffff7c40000 r--p 0001a000 08:00 253822 /home/silly/silly/luaclib/sys.so
Size: 12 kB
7ffff7c40000-7ffff7c41000 r--p 0001c000 08:00 253822 /home/silly/silly/luaclib/sys.so
Size: 4 kB
7ffff7c41000-7ffff7c42000 rw-p 0001d000 08:00 253822 /home/silly/silly/luaclib/sys.so
Size: 4 kB
7ffff7c42000-7ffff7c4e000 rw-p 00000000 00:00 0
Size: 48 kB
理论上这是不可能的。我想了很久,觉得只有一种可能,那就是刚mmap之后,程序就使用munmap释放掉了这块地址空间。然后刚好这块地址空间就被OS拿来做sys.so文件的映射了。
为了验证这一猜想,我使用strace -f -e mmap,munmap ./silly/silly server-src/config
确认了一下,果然就是这样,下面的strace的部分输出。
mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ffff7a49000
munmap(0x7ffff7a49000, 2097152) = 0
jemalloc在mmap分配了一块内存后,随即就使用munmap给释放了。
这也给我一个提示, 当我们监控内存分配时,只监控mmap是不够用。至少需要同时监控mmap,munmap一起配合分析才行。如果有必要我们还需要监控brk系统调用。
ps. 由于linux会地址随机化,可能会导致多出来很多匿名内存段,不便于排查问题。可以使用echo 0 > /proc/sys/kernel/randomize_va_space
来临时关闭内存地址随机化。