RHEL6环境下透明大页和Hadoop高CPU sys

同学反馈hadoop运行在虚拟机中,出现非常异常的负载过高问题。一旦任务开始运行,系统立即卡得无法操作。

登录检查top,发现任务运行时,CPU主要消耗在system而不是计算的use,这说明存在什么程序异常,导致不断在内核态消耗系统资源,并没有正常在工作计算。

top - 15:52:28 up 1 day,  5:05,  7 users,  load average: 16.69, 5.96, 2.19
Tasks: 181 total,  16 running, 165 sleeping,   0 stopped,   0 zombie
Cpu0  :  5.9%us, 93.8%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.3%hi,  0.0%si,  0.0%st
Cpu1  :  3.6%us, 96.4%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  4.6%us, 95.4%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.3%us, 99.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8058056k total,  7476536k used,   581520k free,    37996k buffers
Swap:        0k total,        0k used,        0k free,  3225056k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  GROUP    P   TIME COMMAND
 2570 hadoop    20   0 4453m 648m  25m S  0.3  8.2  53:12.97 hadoop   2  53:12 java
 1843 hdfs      20   0 2767m 435m  28m S  0.0  5.5  16:52.01 hadoop   0  16:52 java
 2514 hadoop    20   0 2357m 424m  51m S  0.0  5.4   2:08.02 hadoop   3   2:08 java
 1748 hdfs      20   0 3899m 421m  28m S 18.7  5.4  89:12.91 hadoop   0  89:12 java
 2707 root      20   0 4603m 398m  52m S  0.6  5.1   2:51.18 root     1   2:51 java
 2490 hadoop    20   0 2261m 359m  50m S  0.0  4.6   2:37.18 hadoop   3   2:37 java
 1942 hadoop    20   0 4141m 344m  28m S  0.6  4.4  14:30.26 hadoop   3  14:30 java
 2940 hadoop    20   0 2819m 282m  28m S  0.3  3.6   7:10.89 hadoop   3   7:10 java
 4966 root      20   0 3686m 219m  14m S 14.1  2.8   0:51.95 root     1   0:51 java
19706 root      20   0 3027m 164m  13m S  0.3  2.1   2:09.46 root     1   2:09 java
 3097 hadoop    20   0 2746m 156m  27m S  0.0  2.0   1:27.23 hadoop   3   1:27 java
 5830 hadoop    20   0 2198m 107m  27m S 33.7  1.4   1:09.69 hadoop   0   1:09 java
27060 root      20   0  303m  75m 5200 S 18.4  1.0   7:05.99 root     3   7:05 ilogtail
 3043 hue       20   0 2710m  73m 5664 S  0.0  0.9   0:14.46 hue      1   0:14 hue
26082 ganglia   20   0  286m  37m 2556 R 17.8  0.5  68:19.41 ganglia  2  68:19 gmond
 1364 mysql     20   0  626m  28m 4344 S  0.0  0.4   0:31.21 mysql    3   0:31 mysqld
 2688 root      20   0  352m  14m 3440 S  0.3  0.2   0:51.18 root     0   0:51 supervisor
 3268 root      20   0  276m 9716 4940 S  0.0  0.1   0:03.76 root     3   0:03 httpd
 3170 ganglia   20   0  227m 9260 1348 S 19.3  0.1  41:04.32 ganglia  2  41:04 gmetad
 1438 root      20   0 81940 8952 5288 S  0.3  0.1   7:22.85 root     0   7:22 AliHids
27105 root      20   0  172m 7052 3084 S  0.0  0.1   5:01.65 root     0   5:01 python

上述任务运行起来时候,发现系统执行任何检查内存状态命令都很缓慢。例如vmstat时候,卡了好一会才开始有输出。但是vmstat输出可以看到,并没有很高的上下文切换

[root@emr-header-1 ~]# vmstat 1 10
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
13  0      0 587056  37908 3225036    0    0    30    36   57    7  2 25 73  0  0
12  0      0 587056  37908 3225036    0    0     0     0 6229 2902  4 96  0  0  0
12  0      0 586932  37908 3225036    0    0     0     0 4666 2147  5 95  0  0  0
14  0      0 587064  37916 3225028    0    0     0    32 4696 2225  4 96  0  0  0
13  0      0 587064  37916 3225036    0    0     0    92 4700 2252  4 96  0  0  0
13  0      0 586940  37916 3225036    0    0     0     0 4639 2158  4 97  0  0  0
14  0      0 586924  37916 3225036    0    0     0     0 4656 2221  2 99  0  0  0
17  0      0 586924  37916 3225036    0    0     0     0 4578 2216  4 96  0  0  0
18  0      0 586800  37916 3225036    0    0     0     4 4592 2229  4 96  0  0  0
18  0      0 586800  37920 3225036    0    0     0    40 4621 2237  5 95  0  0  0

尝试用ps抓运行在队列中任务也不行,卡住没有任何输出。检查发现ps命令无法使用

[root@emr-header-1 ~]# ps -o comm,pid,ppid,user,time,etime,start,pcpu,state --sort=comm aH | grep '^COMMAND\|R$'

原因是procfs无法访问,尝试cdproc中检查进程状态,发现也不能进入,长时间无输出。

[root@emr-header-1 ~]# cd /proc/2570

google了一下,原来这个问题是RedHat Enterprise Linux 6的透明大页的bug Performance Issues with Transparent Huge Pages (THP) 。详细的排查方法,可以参考 Linux 6 Transparent Huge Pages and Hadoop Workloads

尝试关闭透明大页

检查当前THP状态,可以看到是始终打开的,我们现在关闭掉

[root@emr-header-1 proc]# cat /sys/kernel/mm/redhat_transparent_hugepage/enabled
[always] madvise never

[root@emr-header-1 proc]# echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled

[root@emr-header-1 proc]# cat /sys/kernel/mm/redhat_transparent_hugepage/enabled
always madvise [never]
[root@emr-header-1 2570]# cat /sys/kernel/mm/redhat_transparent_hugepage/defrag
[always] madvise never

关闭

[root@emr-header-1 2570]# echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
[root@emr-header-1 2570]# cat /sys/kernel/mm/redhat_transparent_hugepage/defrag
always madvise [never]

然后重启emr-agent

[root@emr-header-1 2570]# service emr-agent restart
Stopping emr-agent:                                        [  OK  ]
Starting emr-agent:                                        [  OK  ]

此时再次观察,发现任务的还是没有解决问题

top - 17:47:40 up 1 day,  7:01,  8 users,  load average: 14.30, 12.24, 8.06
Tasks: 170 total,   8 running, 162 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.7%us, 99.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu1  :  0.7%us, 99.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.3%us, 99.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  2.0%us, 98.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8058056k total,  7686156k used,   371900k free,    41568k buffers
Swap:        0k total,        0k used,        0k free,  3420192k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
26082 ganglia   20   0  288m  39m 2556 R 55.1  0.5  81:53.35 gmond
15053 hadoop    20   0 4553m 121m  24m S 53.4  1.5   6:36.45 java
20001 root      20   0  297m  70m 5184 S 31.5  0.9   6:56.79 ilogtail
15788 root      20   0 56796  520  364 R 30.9  0.0   0:09.23 rrdtool
18497 root      20   0 3695m 266m  15m S 28.9  3.4   8:45.15 java
15798 root      20   0   112    4    0 R 28.5  0.0   0:01.03 env
15768 root      20   0 15036 1176  840 R 25.6  0.0   0:23.93 top
15787 root      20   0 16924 1228  848 R 25.6  0.0   0:09.62 sampler.py
15797 root      20   0  1180   92   72 R 25.6  0.0   0:01.56 sh
15775 root      20   0  103m 1564 1120 S 21.6  0.0   0:11.87 service
 1748 hdfs      20   0 3899m 383m  28m S 20.6  4.9  96:18.13 java
11108 root      20   0  312m 4216 1660 S  6.0  0.1  20:09.36 aegis_quartz
13983 root      20   0 98328 4160 3156 S  6.0  0.1   3:57.80 sshd
  283 root      20   0     0    0    0 S  3.3  0.0   8:37.16 jbd2/xvda1-8
 2424 root      20   0 98324 4164 3156 S  3.3  0.1   1:37.10 sshd
 3170 ganglia   20   0  227m 9032 1348 S  3.0  0.1  46:59.56 gmetad

这个问题想尝试通过扩容内存和CPU来排查,采用XEN guest操作系统动态添加/移除内存或CPU为虚拟机扩容内存,从原先8G扩容到16G

不过,这个方法还没有做,先看看是哪个占用了sys

查找sys占用较高的进程

使用了一个比较土的方法,先把系统中所有正在运行的进程找出来

ps aux | awk '{print $2}' | tee run_pid

然后对每个进程使用pidstat命令,这样就可以看出是使用多少system

for i in `cat run_pid`;do pidstat -p $i;done | tee run_pid_cpu

此时能够得到的信息类似

Linux 2.6.32-431.23.3.el6.x86_64 (emr-header-1.cluster-25364)   12/19/2016      _x86_64_        (4 CPU)

04:41:59 PM       PID    %usr %system  %guest    %CPU   CPU  Command
04:41:59 PM     14541    0.66    1.87    0.00    2.53     1  java
Linux 2.6.32-431.23.3.el6.x86_64 (emr-header-1.cluster-25364)   12/19/2016      _x86_64_        (4 CPU)

04:41:59 PM       PID    %usr %system  %guest    %CPU   CPU  Command
04:41:59 PM     17981    0.15   13.68    0.00   13.83     1  gmond

现在按照第5列(%system)排序进程

cat run_pid_cpu | grep -v PID | grep -v 2.6.32-431.23.3.el6 | grep -v "^$" | sort -k 5

需要按照第4列(%system)排序就可以知道是哪个进程异常 (参考 Sort file according to the second column?)

    -k, --key=POS1[,POS2]

    start a key at POS1, end it at POS2 (origin 1)

    POS is F[.C][OPTS], where F is the field number and C the character position in the field. OPTS is one or more single-letter ordering options, which override global ordering options for that key. If no key is given, use the entire line as the key.
    -t, --field-separator=SEP

    use SEP instead of non-blank to blank transition

可以看到占用处理器资源的情况主要集中在如下进程(第5列最大的值在最下面),其中异常的是gmond,占用了大量的system,但是缺没有us

04:41:59 PM      1629    0.01    1.27    0.00    1.28     3  bash
04:41:59 PM     10114    0.03    1.31    0.00    1.34     0  aegis_quartz
04:41:59 PM      1846    0.16    1.84    0.00    2.01     0  java
04:41:59 PM     14541    0.66    1.87    0.00    2.53     1  java
04:41:59 PM      1456    0.19    3.11    0.00    3.30     2  AliYunDun
04:41:59 PM      2599   17.73    3.67    0.00   21.40     3  java
04:41:59 PM      4953    0.25    3.70    0.00    3.95     1  java
04:41:59 PM      5883    0.02    3.89    0.00    3.92     1  ilogtail
04:41:59 PM      6644    0.12    8.08    0.00    8.20     2  java
04:41:59 PM     17981    0.15   13.68    0.00   13.83     1  gmond

gmond是集群监控系统ganglia的监控服务,google到 gmond occupying 100% of the CPU 在非多播环境中,gmond进程可能会导致系统CPU百分百占用掉。这种情况下,检查节点是否设置deaf(也就是节点是否接收任何节点的消息),设置

deaf = yes

gmond occupying 100% of the CPU 文档提到 CentOS 6.5有这个现象,ubuntu也有

这个设置貌似有效,虽然还是没有完全解决问题,因为我设置以后再采用上述方法排查

05:26:58 PM      1629    0.01    1.06    0.00    1.06     1  bash
05:26:58 PM     10938    0.01    1.10    0.00    1.11     2  gmond
05:26:58 PM     10114    0.03    1.31    0.00    1.34     0  aegis_quartz
05:26:58 PM     13746    0.01    1.46    0.00    1.47     3  ilogtail
05:26:58 PM     14541    0.60    1.59    0.00    2.19     1  java
05:26:58 PM      1846    0.15    1.68    0.00    1.84     0  java
05:26:58 PM     12909    0.20    2.00    0.00    2.20     1  java
05:26:58 PM      1456    0.19    3.11    0.00    3.30     0  AliYunDun
05:26:58 PM      2599   14.11    3.47    0.00   17.58     3  java
05:26:58 PM     15105    0.08    4.21    0.00    4.29     2  java

看上去至少进程gmond已经不再大量消耗sys

strace

  • gmond进行 strace查看系统调用

strace -c -p 10938

显示输出

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.99   13.704919       39157       350           read
  0.01    0.001210           2       628           fstat
  0.00    0.000211           1       326         1 open
  0.00    0.000192           1       314           munmap
  0.00    0.000147           0       327           close
  0.00    0.000147           0       315           mmap
  0.00    0.000087           0       274       137 write
  0.00    0.000024           0       137           sendto
  0.00    0.000000           0         6           stat
  0.00    0.000000           0       325           lstat
  0.00    0.000000           0         3         3 ioctl
  0.00    0.000000           0         1           select
  0.00    0.000000           0         3           socket
  0.00    0.000000           0         4           getdents
  0.00    0.000000           0         4           statfs
------ ----------- ----------- --------- --------- ----------------
100.00   13.706937                  3017       141 total

检查堆栈

[root@emr-header-1 ~]# cat /proc/10938/stack
[<ffffffff811a06d9>] poll_schedule_timeout+0x39/0x60
[<ffffffff811a179c>] do_select+0x57c/0x6c0
[<ffffffff811a1a6a>] core_sys_select+0x18a/0x2c0
[<ffffffff811a1df7>] sys_select+0x47/0x110
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
for i in `ps -AL |grep gmond |cut -c 7-12`; do \
    echo === $i ===; \
    gdb --q --n --ex bt --batch --pid $i; \
done 2>&1 |tee /tmp/stacks.txt

显示输出

=== 10938 ===
[Thread debugging using libthread_db enabled]
0x0000003bfb4e1353 in __select_nocancel () from /lib64/libc.so.6
#0  0x0000003bfb4e1353 in __select_nocancel () from /lib64/libc.so.6
#1  0x0000003c03023125 in apr_sleep () from /usr/lib64/libapr-1.so.0
#2  0x00000000004088f8 in main ()
  • 看上去AliYunDun 1456 似乎也是一个问题

在没有运行java程序对时候显示如下

[root@emr-header-1 ~]# strace -c -p 1456
Process 1456 attached - interrupt to quit
^CProcess 1456 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.033997         103       331           nanosleep
  0.00    0.000000           0         1           restart_syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.033997                   332           total
[root@emr-header-1 ~]# strace -c -p 1456
Process 1456 attached - interrupt to quit
^CProcess 1456 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.021996          89       248           nanosleep
  0.00    0.000000           0         1           restart_syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.021996                   249           total

开始运行任务以后尝试strace云盾进程,发现这个进程不允许跟踪

[root@emr-header-1 2]# strace -c -p 1456
attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted

开始运行任务以后显示

for i in `ps -AL |grep AliYunDun |cut -c 7-12`; do \
    echo === $i ===; \
    gdb --q --n --ex bt --batch --pid $i; \
done 2>&1 |tee /tmp/stacks.txt

显示输出

=== 1109 ===
[New LWP 1137]
[New LWP 1111]
[New LWP 1110]
[Thread debugging using libthread_db enabled]
0x0000003bfb4aca3d in nanosleep () from /lib64/libc.so.6
#0  0x0000003bfb4aca3d in nanosleep () from /lib64/libc.so.6
#1  0x0000003bfb4ac8b0 in sleep () from /lib64/libc.so.6
#2  0x0000000000441333 in ?? ()
#3  0x000000000041a7dd in ?? ()
#4  0x000000000041ada4 in ?? ()
#5  0x00000000004199ab in ?? ()
#6  0x0000000000409767 in ?? ()
#7  0x0000003bfb41ed5d in __libc_start_main () from /lib64/libc.so.6
#8  0x0000000000409629 in ?? ()
#9  0x00007fff8726fd48 in ?? ()
#10 0x000000000000001c in ?? ()
#11 0x0000000000000001 in ?? ()
#12 0x00007fff87270e99 in ?? ()
#13 0x0000000000000000 in ?? ()
=== 1110 ===

warning: process 1110 is a cloned process
[Thread debugging using libthread_db enabled]
0x0000003bfb4aca3d in nanosleep () from /lib64/libc.so.6
#0  0x0000003bfb4aca3d in nanosleep () from /lib64/libc.so.6
#1  0x0000003bfb4e1be4 in usleep () from /lib64/libc.so.6
#2  0x000000000044131a in ?? ()
#3  0x000000000043e650 in ?? ()
#4  0x000000000043d84f in aqs::CThreadUtil::ThreadFunc(void*) ()
#5  0x0000003bfbc079d1 in start_thread () from /lib64/libpthread.so.0
#6  0x0000003bfb4e88fd in clone () from /lib64/libc.so.6
=== 1111 ===

warning: process 1111 is a cloned process
[Thread debugging using libthread_db enabled]
0x0000003bfb4aca3d in nanosleep () from /lib64/libc.so.6
#0  0x0000003bfb4aca3d in nanosleep () from /lib64/libc.so.6
#1  0x0000003bfb4e1be4 in usleep () from /lib64/libc.so.6
#2  0x000000000044131a in ?? ()
#3  0x0000000000422628 in ?? ()
#4  0x000000000043d84f in aqs::CThreadUtil::ThreadFunc(void*) ()
#5  0x0000003bfbc079d1 in start_thread () from /lib64/libpthread.so.0
#6  0x0000003bfb4e88fd in clone () from /lib64/libc.so.6
=== 1137 ===

warning: process 1137 is a cloned process
[Thread debugging using libthread_db enabled]
0x0000003bfb4aca3d in nanosleep () from /lib64/libc.so.6
#0  0x0000003bfb4aca3d in nanosleep () from /lib64/libc.so.6
#1  0x0000003bfb4ac8b0 in sleep () from /lib64/libc.so.6
#2  0x0000000000441333 in ?? ()
#3  0x0000000000417c43 in ?? ()
#4  0x0000000000417e80 in ?? ()
#5  0x000000000043d84f in aqs::CThreadUtil::ThreadFunc(void*) ()
#6  0x0000003bfbc079d1 in start_thread () from /lib64/libpthread.so.0
#7  0x0000003bfb4e88fd in clone () from /lib64/libc.so.6
=== 1456 ===
  • 对java进程

strace对于java无效

[root@emr-header-1 ~]# strace -c -p 22345
Process 22345 attached - interrupt to quit
^CProcess 22345 detached

java

[root@emr-header-1 ~]# cat /proc/22345/stack
[<ffffffff810ae569>] futex_wait_queue_me+0xb9/0xf0
[<ffffffff810af678>] futex_wait+0x1f8/0x380
[<ffffffff810b0f91>] do_futex+0x121/0xb50
[<ffffffff810b1a3b>] sys_futex+0x7b/0x170
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

jps

[root@emr-header-1 ganglia]# cat /proc/22437/stack
[<ffffffff810ae569>] futex_wait_queue_me+0xb9/0xf0
[<ffffffff810af678>] futex_wait+0x1f8/0x380
[<ffffffff810b0f91>] do_futex+0x121/0xb50
[<ffffffff810b1a3b>] sys_futex+0x7b/0x170
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

发现多个java进程都是等待在futex_wait_queue_me,搜索到[App hangs on futex_wait_queue_me() every a couple of minutes](App hangs on futex_wait_queue_me() every a couple of minutes) 说明:

CentOS 6的内核在运行java和多线程的程序会存在bug,升级到 2.6.32-504.16.2.el6.x86_64 后可以解决,具体的bug报告在 https://bugs.centos.org/view.php?id=8703 ,详细的问题解释在 futex: Ensure get_futex_key_refs() always implies a barrier

b0c29f7

Commit b0c29f7 (futexes: Avoid taking the hb->lock if there's
nothing to wake up) changes the futex code to avoid taking a lock when
there are no waiters. This code has been subsequently fixed in commit
11d4616 (futex: revert back to the explicit waiter counting code).
Both the original commit and the fix-up rely on get_futex_key_refs() to
always imply a barrier.

However, for private futexes, none of the cases in the switch statement
of get_futex_key_refs() would be hit and the function completes without
a memory barrier as required before checking the "waiters" in
futex_wake() -> hb_waiters_pending(). The consequence is a race with a
thread waiting on a futex on another CPU, allowing the waker thread to
read "waiters == 0" while the waiter thread to have read "futex_val ==
locked" (in kernel).

Without this fix, the problem (user space deadlocks) can be seen with
Android bionic's mutex implementation on an arm64 multi-cluster system.

升级CentOS 6内核 2.6.32-642.11.1.el6 升级以后同样的任务调度不再出现异常的sys使用率。

观察 java 进程的stack,依然是相同的函数,但是没有任何阻塞

# cat /proc/30964/stack
[<ffffffff810ba4ea>] futex_wait_queue_me+0xba/0xf0
[<ffffffff810bb660>] futex_wait+0x1c0/0x310
[<ffffffff810bcf61>] do_futex+0x121/0xae0
[<ffffffff810bd99b>] sys_futex+0x7b/0x170
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

# cat /proc/30964/stack
[<ffffffff810ba4ea>] futex_wait_queue_me+0xba/0xf0
[<ffffffff810bb660>] futex_wait+0x1c0/0x310
[<ffffffff810bcf61>] do_futex+0x121/0xae0
[<ffffffff810bd99b>] sys_futex+0x7b/0x170
[<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

参考

Last updated