systemd环境syslog-ng启动失败排查

尝试strace方法跟踪启动时候的库调用

strace /usr/sbin/syslog-ng -F -p /var/run/syslog-ng.pid

结果发现最后出现的问题可能和NSCD service for DNS有关

open("/etc/resolv.conf", O_RDONLY|O_CLOEXEC) = 7
fstat(7, {st_mode=S_IFREG|0644, st_size=91, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f30b584e000
read(7, "options timeout:2 attempts:2\nsea"..., 4096) = 91
read(7, "", 4096)                       = 0
close(7)                                = 0
munmap(0x7f30b584e000, 4096)            = 0
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 7
connect(7, {sa_family=AF_LOCAL, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
close(7)                                = 0
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 7
connect(7, {sa_family=AF_LOCAL, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
close(7)                                = 0
open("/etc/host.conf", O_RDONLY|O_CLOEXEC) = 7
fstat(7, {st_mode=S_IFREG|0644, st_size=9, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f30b584e000
read(7, "multi on\n", 4096)             = 9
read(7, "", 4096)                       = 0
close(7)                                = 0
munmap(0x7f30b584e000, 4096)            = 0
futex(0x7f30b3baca70, FUTEX_WAKE_PRIVATE, 2147483647) = 0
open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 7
fstat(7, {st_mode=S_IFREG|0644, st_size=87, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f30b584e000
read(7, "127.0.0.1 localhost localhost.lo"..., 4096) = 87
read(7, "", 4096)                       = 0
close(7)                                = 0
munmap(0x7f30b584e000, 4096)            = 0
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 7
fstat(7, {st_mode=S_IFREG|0644, st_size=34586, ...}) = 0
mmap(NULL, 34586, PROT_READ, MAP_PRIVATE, 7, 0) = 0x7f30b5804000
close(7)                                = 0
open("/lib64/libnss_dns.so.2", O_RDONLY|O_CLOEXEC) = 7
read(7, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\21\0\0\0\0\0\0"..., 832) = 832
fstat(7, {st_mode=S_IFREG|0755, st_size=27512, ...}) = 0
mmap(NULL, 2117888, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 7, 0) = 0x7f30af50f000
mprotect(0x7f30af514000, 2093056, PROT_NONE) = 0
mmap(0x7f30af713000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 7, 0x4000) = 0x7f30af713000
close(7)                                = 0
mprotect(0x7f30af713000, 4096, PROT_READ) = 0
munmap(0x7f30b5804000, 34586)           = 0
write(2, "Error resolving hostname; host='"..., 34Error resolving hostname; host=''
) = 34
write(2, "Error initializing message pipel"..., 37Error initializing message pipeline;
) = 37
close(5)                                = 0
munmap(0x7f30b580d000, 16384)           = 0
unlink("/var/lib/syslog-ng/syslog-ng.persist-") = 0
exit_group(2)                           = ?
+++ exited with 2 +++

5 simple ways to troubleshoot using Strace 恰好也有一个strace的案例是调试连接服务器问题的,和目前的情况类似。

Troubleshooting and debugging syslog-ng提供了一些debug的建议:

  • 使用debug方式启动syslog-ng -Fevd可以查看详细日志

  • 使用strace进行跟踪 strace -s 256 -f syslog-ng (这个方法即上述使用)

  • 如果出现crash,可以使用 syslog-ng -F –enable-core (也可以使用环境变量 ulimit -c unlimited)。不过,对于有core文件,并且syslog-ng是使用了--enable-debug编译的,则可以使用gdb进行调试(gdb documentation

  • 可以使用logger发送日志给/dev/log,例如logger “message part”

  • 远程发送日志的方法:

    • 本地启用命令 logger -r 10 -i -s 300 -I 2 10.30.0.32 9999,远程则启动nc监听:nc -lp 9999

gdb syslog-ng core

注意:syslog-ng -V应该有Enable-Debug: on才表示编译的时候激活了debug

connect(7, {sa_family=AF_LOCAL, sun_path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)表示尝试连接/var/run/nscd/socket,这个NSCD是Name Service Cache Daemon,通常用于响应NIS,YP, LDAP或类似名字查询的目录协议。

连接NSCD失败后,下一步显示服务转向读取/etc/host.conf配置,可以看到如下信息

open("/etc/host.conf", O_RDONLY|O_CLOEXEC) = 7
fstat(7, {st_mode=S_IFREG|0644, st_size=9, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f30b584e000
read(7, "multi on\n", 4096)             = 9
read(7, "", 4096)                       = 0
close(7)                                = 0
munmap(0x7f30b584e000, 4096)            = 0

检查了一下/etc/host.conf配置确实只有一行

multi on

然后读取的是/etc/hosts

open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 7
fstat(7, {st_mode=S_IFREG|0644, st_size=87, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f30b584e000
read(7, "127.0.0.1 localhost localhost.lo"..., 4096) = 87
read(7, "", 4096)                       = 0
close(7)                                = 0
munmap(0x7f30b584e000, 4096)            = 0

但是,只看到后来有打开/etc/ld.so.cache/lib64/libnss_dns.so.2,但是没有看到查询DNS(例如连接DNS服务器端口53的记录)

比较奇怪,我找了线上另外一个集群,同样也没有对服务器的IP地址做正向和反向的DNS解析,但是这个集群的syslog-ng服务却能够正常启动

root     32215  0.0  0.0 100504  4512 ?        Ss   23:02   0:00 /usr/sbin/syslog-ng -F -p /var/run/syslogd.pid

man syslog-ng说明-F参数表示非daemon运行,是foreground运行模式。

我同样使用strace /usr/sbin/syslog-ng -F -p /var/run/syslog-ng.pid,惊奇地发现,这个正常服务器上启动的sysylog-ng居然没有去读取/etc/host.conf/etc/hosts,并且也没有尝试对NSCD进行访问。

正常服务器上syslog-ng版本是syslog-ng-3.5.6-3;异常服务器上syslog-ng版本是syslog-ng-3.5.6-5。我也对比了两个主机的syslog-ng.conf配置完全一致。

但是看到syslog-ng.conf配置的前面部分

options {
    flush_lines (0);
    time_reopen (10);
    log_fifo_size (1000);
    chain_hostnames (off);
    use_dns (no);
    use_fqdn (no);
    create_dirs (yes);
    keep_hostname (yes);
    dir_perm(0755);
    owner("root");
    group("adm");
    perm(0640);
    stats_freq(0);

};

其中有一个配置是use_dns (no);,这也就解释了为何使用strace排查,只看到读取了/etc/hosts文件,却看不到DNS解析的请求。

会不会软件版本升级以后,默认启动时候会进行一次初始化的主机名的解析,但是配置文件中设置了use_dns (no);,导致无法完成解析,所以失败。

尝试修改syslog-ng.conf的选项启动DNS,但是发现没有解决问题,报错信息相同

options {
    ...
    use_dns (yes);
    use_fqdn (yes);
    ...
};

回滚了一次syslog-ng-3.5.6-3,报错信息依旧。但是,我发现,如果使用syslog-ng安装包的初始化配置,是可以正常启动syslog-ng服务的。所以,导致这个异常的问题应该是:

  • 自定义的syslog-ng.conf中有配置项比较特别,在某个特定的环境下会激活host查询,但是却无法完成解析导致无法启动

  • 正常的集群中同样地配置,却没有host解析动作,所以绕过了这个bug

  • 但是,是什么环境触发了这个问题?

注意到自定义配置中有如下

options {
    flush_lines (0);
    time_reopen (10);
    log_fifo_size (1000);
    chain_hostnames (off);
    use_dns (no);
    use_fqdn (no);
    create_dirs (yes);
    keep_hostname (yes);
    dir_perm(0755);
    owner("root");
    group("adm");
    perm(0640);
    stats_freq(0);

};

而rpm包安装的默认配置有

options {
    flush_lines (0);
    time_reopen (10);
    log_fifo_size (1000);
    chain_hostnames (off);
    use_dns (no);
    use_fqdn (no);
    create_dirs (no);
    keep_hostname (yes);
};

其中有一行create_dirs (yes);差异,我修改了也没有解决问题。

设置syslog-ng使用本地主机名解析

17.3.1. Procedure – Resolving hostnames locally提供了设置syslog-ng的本地DNS解析方法,使用本地存储的主机名解析

options {
        use-dns(persist_only);
        dns-cache-hosts(/etc/hosts); };

不过这个方法测试设置还是没有解决

syslog-ng低于3.2.5版本需要配置unix-dgram ("/dev/log");

  • syslog-ng < 3.2.5 版本时会默认 /dev/logunix-stream,但是systemd将这个设备文件作为unix-dgram,所以需要修改 /etc/syslog-ng/syslog-ng.conf

...
 source s_sys {
  file ("/proc/kmsg" program_override("kernel: "));
- unix-stream ("/dev/log");
+ unix-dgram ("/dev/log");
  internal();
  # udp(ip(0.0.0.0) port(514));
 };
...

如果没有上述修订配置,启动syslog-ng会出现以下报错信息

May 07 17:26:15 superserver.company.corp systemd[1]: Starting Syslog Socket.
May 07 17:26:15 superserver.company.corp systemd[1]: Listening on Syslog Socket.
May 07 17:26:15 superserver.company.corp systemd[1]: Starting System Logger Daemon...
May 07 17:26:15 superserver.company.corp systemd[1]: syslog-ng.service: main process exited, code=exited, status=2/INVALIDARGUMENT
May 07 17:26:15 superserver.company.corp systemd[1]: Failed to start System Logger Daemon.
May 07 17:26:15 superserver.company.corp systemd[1]: Unit syslog-ng.service entered failed state.
May 07 17:26:15 superserver.company.corp systemd[1]: syslog-ng.service holdoff time over, scheduling restart.
May 07 17:26:15 superserver.company.corp systemd[1]: Stopping System Logger Daemon...
May 07 17:26:15 superserver.company.corp systemd[1]: Starting System Logger Daemon...
May 07 17:26:15 superserver.company.corp systemd[1]: syslog-ng.service: main process exited, code=exited, status=2/INVALIDARGUMENT

开启debug模式会看到如下报错

May 08 10:31:29 server.corp systemd[1]: Starting System Logger Daemon...
May 08 10:31:29 server.corp systemd[1]: About to execute: /usr/sbin/syslog-ng -F -p /var/run/syslogd.pid
May 08 10:31:29 server.corp systemd[1]: Forked /usr/sbin/syslog-ng as 3121
May 08 10:31:29 server.corp systemd[1]: syslog-ng.service changed dead -> start
May 08 10:31:29 server.corp systemd[1]: Set up jobs progress timerfd.
May 08 10:31:29 server.corp systemd[1]: Set up idle_pipe watch.
May 08 10:31:29 server.corp systemd[3121]: Executing: /usr/sbin/syslog-ng -F -p /var/run/syslogd.pid
May 08 10:31:29 server.corp systemd[1]: Got notification message for unit syslog-ng.service
May 08 10:31:29 server.corp systemd[1]: syslog-ng.service: Got message
May 08 10:31:29 server.corp systemd[1]: syslog-ng.service: got STATUS=Starting up... (Fri May  8 10:31:29 2015
May 08 10:31:29 server.corp systemd[1]: Got notification message for unit syslog-ng.service
May 08 10:31:29 server.corp systemd[1]: syslog-ng.service: Got message
May 08 10:31:29 server.corp systemd[1]: syslog-ng.service: got STATUS=Starting up... (Fri May  8 10:31:29 2015
May 08 10:31:29 server.corp systemd[1]: Received SIGCHLD from PID 3121 (syslog-ng).
May 08 10:31:29 server.corp systemd[1]: Child 3121 (syslog-ng) died (code=exited, status=2/INVALIDARGUMENT)
May 08 10:31:29 server.corp systemd[1]: Child 3121 belongs to syslog-ng.service
May 08 10:31:29 server.corp systemd[1]: syslog-ng.service: main process exited, code=exited, status=2/INVALIDARGUMENT
May 08 10:31:29 server.corp systemd[1]: syslog-ng.service changed start -> failed
May 08 10:31:29 server.corp systemd[1]: Job syslog-ng.service/start finished, result=failed
May 08 10:31:29 server.corp systemd[1]: Failed to start System Logger Daemon.

参考

Last updated