一次虚拟内存排查经历

事情的起因是这样的,我最近在给silly增加prometheus数据库支持。

在测试过程中发现,在docker中,silly刚起动就占了将近110MiB虚拟内存。

我将相同的代码在宿主机直接直行,虚拟内存只有48.32MiB。

与此同时, silly暴露给prometheus数据库的指标显示,应用程序分配了3.7MiB内存,而jemalloc一共给应用程序分配了9.38 MiB内存。

我打算先来看看这48.32MiB内存是不是合理的。

通过cat /proc/[pid]/smaps查看了一下虚拟内存的大致分配,下面是一些比较大的匿名内存段。

7ffff587e000-7ffff607e000 rw-p 00000000 00:00 0
Size:               8192 kB
...
7ffff607f000-7ffff687f000 rw-p 00000000 00:00 0
Size:               8192 kB
...
7ffff6880000-7ffff7080000 rw-p 00000000 00:00 0
Size:               8192 kB
...
7ffff7080000-7ffff7c00000 rw-p 00000000 00:00 0
Size:              11776 kB

其中三个8192KiB内存块肯定是三个线程栈所分配的虚拟内存,一共24MiB。

而11776KiB这块内存大小和jemalloc向应用程序分配的内存大小非常接近(稍大一些), 有理由怀疑这块内存其实就是jemalloc所使用的虚拟内存大小,毕竟jemalloc本身的元数据也是需要一些内存消耗的。

为了确认这一怀疑,我使用strace -f -k -e mmap ./silly/silly server-src/config来确认这一点(事实上这一步我遇到了困境,因为我没有加-f标志,导致strace不能跟踪所有线程的系统调用,以致于浪费了很多时间)。

下面是截取的一段jemalloc相关的mmap系统调用,可以看到这块内存刚好就是jemalloc所用掉的。

mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffff7fcf000
 > /usr/lib/x86_64-linux-gnu/libc-2.28.so(mmap64+0x43) [0xf3cd3]
 > /home/silly/silly(je_pages_boot+0x1c7) [0xb0877]
 > /home/silly/silly(malloc_init_hard_a0_locked+0x12d) [0x4f20d]
 > /home/silly/silly(malloc_init_hard+0x79) [0x4f4d9]
 > /home/silly/silly(__libc_csu_init+0x45) [0xc3895]
 > /usr/lib/x86_64-linux-gnu/libc-2.28.so(__libc_start_main+0x7a) [0x2402a]
 > /home/silly/silly(_start+0x2a) [0x1dcea]
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ffff7fcf000
 > /usr/lib/x86_64-linux-gnu/libc-2.28.so(mmap64+0x43) [0xf3cd3]
 > /home/silly/silly(je_pages_boot+0x2b7) [0xb0967]
 > /home/silly/silly(malloc_init_hard_a0_locked+0x12d) [0x4f20d]
 > /home/silly/silly(malloc_init_hard+0x79) [0x4f4d9]
 > /home/silly/silly(__libc_csu_init+0x45) [0xc3895]
 > /usr/lib/x86_64-linux-gnu/libc-2.28.so(__libc_start_main+0x7a) [0x2402a]
 > /home/silly/silly(_start+0x2a) [0x1dcea]
mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ffff7a49000
 > /usr/lib/x86_64-linux-gnu/libc-2.28.so(mmap64+0x43) [0xf3cd3]
 > /home/silly/silly(je_pages_map+0x4c) [0xb00cc]
 > /home/silly/silly(je_extent_alloc_mmap+0x14) [0xa56c4]
 > /home/silly/silly(base_block_alloc.isra.21+0x281) [0x63681]
 > /home/silly/silly(je_base_new+0x70) [0x64030]
 > /home/silly/silly(je_base_boot+0x17) [0x64b97]
 > /home/silly/silly(malloc_init_hard_a0_locked+0x138) [0x4f218]
 > /home/silly/silly(malloc_init_hard+0x79) [0x4f4d9]
 > /home/silly/silly(__libc_csu_init+0x45) [0xc3895]
 > /usr/lib/x86_64-linux-gnu/libc-2.28.so(__libc_start_main+0x7a) [0x2402a]
 > /home/silly/silly(_start+0x2a) [0x1dcea]
mmap(NULL, 4190208, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ffff784a000
 > /usr/lib/x86_64-linux-gnu/libc-2.28.so(mmap64+0x43) [0xf3cd3]
 > /home/silly/silly(je_pages_map+0x163) [0xb01e3]
 > /home/silly/silly(je_extent_alloc_mmap+0x14) [0xa56c4]
 > /home/silly/silly(base_block_alloc.isra.21+0x281) [0x63681]
 > /home/silly/silly(je_base_new+0x70) [0x64030]
 > /home/silly/silly(je_base_boot+0x17) [0x64b97]
 > /home/silly/silly(malloc_init_hard_a0_locked+0x138) [0x4f218]
 > /home/silly/silly(malloc_init_hard+0x79) [0x4f4d9]
 > /home/silly/silly(__libc_csu_init+0x45) [0xc3895]
 > /usr/lib/x86_64-linux-gnu/libc-2.28.so(__libc_start_main+0x7a) [0x2402a]
 > /home/silly/silly(_start+0x2a) [0x1dcea]
mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ffff7800000
 > /usr/lib/x86_64-linux-gnu/libc-2.28.so(mmap64+0x43) [0xf3cd3]
 > /home/silly/silly(je_pages_map+0x4c) [0xb00cc]
 > /home/silly/silly(je_extent_alloc_mmap+0x14) [0xa56c4]
 > /home/silly/silly(je_ehooks_default_alloc_impl+0xd3) [0x9de03]
 > /home/silly/silly(je_ecache_alloc_grow+0x9b9) [0xa5039]
 > /home/silly/silly(pac_alloc_real+0x93) [0xaf3e3]
 > /home/silly/silly(pac_alloc_impl+0xe5) [0xaf525]
 > /home/silly/silly(je_pa_alloc+0x57) [0xae547]
 > /home/silly/silly(je_arena_extent_alloc_large+0x9f) [0x5b56f]
 > /home/silly/silly(je_large_palloc+0xcb) [0xac53b]
 > /home/silly/silly(je_arena_palloc+0xd5) [0x5ede5]
 > /home/silly/silly(je_tsd_tcache_data_init+0xe2) [0xbc2a2]
 > /home/silly/silly(je_tsd_tcache_enabled_data_init+0x28) [0xbc948]
 > /home/silly/silly(je_tsd_fetch_slow+0x114) [0xbe3d4]
 > /home/silly/silly(malloc_init_hard+0x9b) [0x4f4fb]
 > /home/silly/silly(__libc_csu_init+0x45) [0xc3895]
 > /usr/lib/x86_64-linux-gnu/libc-2.28.so(__libc_start_main+0x7a) [0x2402a]
 > /home/silly/silly(_start+0x2a) [0x1dcea]
mmap(NULL, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ffff7400000
 > /usr/lib/x86_64-linux-gnu/libc-2.28.so(mmap64+0x43) [0xf3cd3]
 > /home/silly/silly(je_pages_map+0x4c) [0xb00cc]
 > /home/silly/silly(je_extent_alloc_mmap+0x14) [0xa56c4]
 > /home/silly/silly(base_block_alloc.isra.21+0x281) [0x63681]
 > /home/silly/silly(je_base_alloc+0x269) [0x644d9]
 > /home/silly/silly(je_rtree_leaf_elm_lookup_hard+0x100) [0xb5280]
 > /home/silly/silly(je_emap_register_boundary+0x357) [0x9ea17]
 > /home/silly/silly(je_ecache_alloc_grow+0x60b) [0xa4c8b]
 > /home/silly/silly(pac_alloc_real+0x93) [0xaf3e3]
 > /home/silly/silly(pac_alloc_impl+0xe5) [0xaf525]
 > /home/silly/silly(je_pa_alloc+0x57) [0xae547]
 > /home/silly/silly(je_arena_extent_alloc_large+0x9f) [0x5b56f]
 > /home/silly/silly(je_large_palloc+0xcb) [0xac53b]
 > /home/silly/silly(je_arena_palloc+0xd5) [0x5ede5]
 > /home/silly/silly(je_tsd_tcache_data_init+0xe2) [0xbc2a2]
 > /home/silly/silly(je_tsd_tcache_enabled_data_init+0x28) [0xbc948]
 > /home/silly/silly(je_tsd_fetch_slow+0x114) [0xbe3d4]
 > /home/silly/silly(malloc_init_hard+0x9b) [0x4f4fb]
 > /home/silly/silly(__libc_csu_init+0x45) [0xc3895]
 > /usr/lib/x86_64-linux-gnu/libc-2.28.so(__libc_start_main+0x7a) [0x2402a]
 > /home/silly/silly(_start+0x2a) [0x1dcea]
mmap(NULL, 3670016, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ffff7080000
 > /usr/lib/x86_64-linux-gnu/libc-2.28.so(mmap64+0x43) [0xf3cd3]
 > /home/silly/silly(je_pages_map+0x4c) [0xb00cc]
 > /home/silly/silly(je_extent_alloc_mmap+0x14) [0xa56c4]
 > /home/silly/silly(je_ehooks_default_alloc_impl+0xd3) [0x9de03]
 > /home/silly/silly(je_ecache_alloc_grow+0x9b9) [0xa5039]
 > /home/silly/silly(pac_alloc_real+0x93) [0xaf3e3]
 > /home/silly/silly(pac_alloc_impl+0xe5) [0xaf525]
 > /home/silly/silly(je_pa_alloc+0x57) [0xae547]
 > /home/silly/silly(je_arena_extent_alloc_large+0x9f) [0x5b56f]
 > /home/silly/silly(je_large_palloc+0xcb) [0xac53b]
 > /home/silly/silly(je_arena_malloc_hard+0x50d) [0x5ec0d]
 > /home/silly/silly(je_malloc_default+0x5a1) [0x52371]
 > /home/silly/silly(silly_malloc+0x6) [0x23356]
 > /home/silly/silly(silly_socket_init+0x57) [0x215e7]
 > /home/silly/silly(silly_run+0x88) [0x22cb8]
 > /home/silly/silly(main+0x468) [0x1d888]
 > /usr/lib/x86_64-linux-gnu/libc-2.28.so(__libc_start_main+0xeb) [0x2409b]
 > /home/silly/silly(_start+0x2a) [0x1dcea]

有了宿主机的这些经验后,再去查看docker中那110MiB内存来源时就轻车熟路了。

首先使用docker run --rm --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -it --entrypoint /bin/bash registry.cn-hangzhou.aliyuncs.com/findstr-vps/xxx来强行改写entrypoint为bash和添加strace的能力,之后所有的分析都和宿主机一样了。

通过cat /proc/[pid]/smaps发现了一大块内存一个字节都没有使用,他很可疑.

7ffff0021000-7ffff4000000 ---p 00000000 00:00 0
Size:              65404 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                   0 kB
Pss:                   0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:            0 kB
Anonymous:             0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:    1
VmFlags: mr mw me nr sd

再使用strace -f -k -e mmap ./silly/silly server-conf/config, 可以找到相应的callstack如下:

[pid    34] mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7fffed87d000
 > /lib/x86_64-linux-gnu/libc-2.28.so(mmap64+0x43) [0xf3cd3]
 > /lib/x86_64-linux-gnu/libc-2.28.so(_IO_str_seekoff+0xd87) [0x80127]
 > /lib/x86_64-linux-gnu/libc-2.28.so(_IO_str_seekoff+0x157c) [0x8091c]
 > /lib/x86_64-linux-gnu/libc-2.28.so(_IO_str_seekoff+0x44ad) [0x8384d]
 > /lib/x86_64-linux-gnu/libc-2.28.so(__libc_malloc+0x116) [0x84626]
 > /lib/x86_64-linux-gnu/libc-2.28.so(fgets+0x1bb) [0x7022b]
 > /silly/silly(luaL_loadfilex+0x4d) [0x3a35d]
 > /silly/silly(silly_worker_start+0x19b) [0x2200b]
 > /silly/silly(thread_worker+0xb) [0x228fb]
 > /lib/x86_64-linux-gnu/libpthread-2.28.so(start_thread+0xf3) [0x7fa3]
 > /lib/x86_64-linux-gnu/libc-2.28.so(clone+0x3f) [0xf906f]

可以看到是libc中的fgets函数导致的,当我使用gdb断这个函数时,相同的系统调用又会出现在别的线程中调用的glibc函数中。

我怀疑这是glibc的一个bug。报着试试看的态度google了一下,没想到这是一个Feature(glibc 2.10+版本的多线程程序,glibc会预分配很多虚拟内存, 用来提高性能)。

但是奇怪的时,我的宿主机和docker的glibc的版本都是2.28, 宿主机却没有虚拟内存问题。

ldd (Debian GLIBC 2.28-10+deb10u2) 2.28
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.

7月30号补充:

在分析过程中,我还发现一个奇怪的现象。

mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ffff7a49000 产出了块大内存,但是从smaps中发现,这块内存包含了文件/home/silly/silly/luaclib/sys.so的内存映射, 截取部分smaps如下:

7ffff7080000-7ffff7c00000 rw-p 00000000 00:00 0
Size:              11776 kB
7ffff7c23000-7ffff7c27000 r--p 00000000 08:00 253822                     /home/silly/silly/luaclib/sys.so
Size:                 16 kB
7ffff7c27000-7ffff7c3d000 r-xp 00004000 08:00 253822                     /home/silly/silly/luaclib/sys.so
Size:                 88 kB
7ffff7c3d000-7ffff7c40000 r--p 0001a000 08:00 253822                     /home/silly/silly/luaclib/sys.so
Size:                 12 kB
7ffff7c40000-7ffff7c41000 r--p 0001c000 08:00 253822                     /home/silly/silly/luaclib/sys.so
Size:                  4 kB
7ffff7c41000-7ffff7c42000 rw-p 0001d000 08:00 253822                     /home/silly/silly/luaclib/sys.so
Size:                  4 kB
7ffff7c42000-7ffff7c4e000 rw-p 00000000 00:00 0
Size:                 48 kB

理论上这是不可能的。我想了很久,觉得只有一种可能,那就是刚mmap之后,程序就使用munmap释放掉了这块地址空间。然后刚好这块地址空间就被OS拿来做sys.so文件的映射了。

为了验证这一猜想,我使用strace -f -e mmap,munmap ./silly/silly server-src/config确认了一下,果然就是这样,下面的strace的部分输出。

mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ffff7a49000
munmap(0x7ffff7a49000, 2097152)         = 0

jemalloc在mmap分配了一块内存后,随即就使用munmap给释放了。

这也给我一个提示, 当我们监控内存分配时,只监控mmap是不够用。至少需要同时监控mmap,munmap一起配合分析才行。如果有必要我们还需要监控brk系统调用。

ps. 由于linux会地址随机化,可能会导致多出来很多匿名内存段,不便于排查问题。可以使用echo 0 > /proc/sys/kernel/randomize_va_space来临时关闭内存地址随机化。

发表评论

thirty two − = twenty five