有时候,突然会发生 A 不能访问 B(无法ping或者ssh到B),但是,此时在B主机上,ping外部网络都是正常的(通过网关)。而A访问外部网络也正常。
由于A和B是同一个网段的两个主机,所以就推测到问题和ARP有关
排查
当异常发生时,在B主机上检查:
[root@devstack ~]# arp -an
? (30.17.44.254) at 58:69:6c:31:de:16 [ether] on wlp3s0
? (172.17.0.2) at 02:42:ac:11:00:02 [ether] on docker0
? (172.17.0.5) at 02:42:ac:11:00:05 [ether] on docker0
? (30.17.44.236) at <incomplete> [ether] on wlp3s0
? (172.17.0.6) at 02:42:ac:11:00:06 [ether] on docker0
? (172.17.0.9) at <incomplete> on docker0
? (172.17.0.8) at 02:42:ac:11:00:08 [ether] on docker0
? (192.168.122.11) at 52:54:00:82:87:0e [ether] on virbr0
当出于STALE状态,如果ping这个地址,就会正确发送给00:25:90:7d:7e:cd数据包。一秒钟或者稍后,就通常会发送一个ARP请求"who has 192.168.42.1"以便能够更新它的缓存回到REACHABLE状态。但是,有时候内核会基于更高层协议反馈修改timeout值。这意味着,如果ping 192.168.42.1并且它答复,则内核可能并不会发送ARP请求因为它假定这个pong已经表明了ARP缓存值是正确的。如果这个对象是STALE状态,它也会被已经查看到的主动ARP相应更新。
[root@devstack ~]# ip neigh
30.17.44.254 dev wlp3s0 lladdr 58:69:6c:31:de:16 STALE
172.17.0.2 dev docker0 lladdr 02:42:ac:11:00:02 STALE
172.17.0.5 dev docker0 lladdr 02:42:ac:11:00:05 REACHABLE
30.17.44.236 dev wlp3s0 lladdr b8:e8:56:33:e4:8a REACHABLE
172.17.0.6 dev docker0 lladdr 02:42:ac:11:00:06 STALE
172.17.0.9 dev docker0 INCOMPLETE
172.17.0.8 dev docker0 lladdr 02:42:ac:11:00:08 STALE
192.168.122.11 dev virbr0 lladdr 52:54:00:82:87:0e STALE
[root@devstack ~]# ip neigh
30.17.44.254 dev wlp3s0 lladdr 58:69:6c:31:de:16 DELAY
...
[root@devstack ~]# ip neigh
30.17.44.254 dev wlp3s0 lladdr 58:69:6c:31:de:16 REACHABLE
$ ip -s neighbor list
192.168.42.1 dev eth0 lladdr 00:25:90:7d:7e:cd ref 2 used 184/184/139 probes 4 STALE
192.168.10.2 dev eth0 lladdr 00:1c:23:cf:0b:6a ref 3 used 33/28/0 probes 1 REACHABLE
192.168.10.1 dev eth0 lladdr 00:17:c5:d8:90:a4 ref 219 used 275/4/121 probes 1 REACHABLE
ip link set arp off dev wlp3s0
ip link set arp on dev wlp3s0
ifdown wlp3s0
ifup wlp3s0
30.17.44.236 b8:e8:56:33:e4:8a
? (30.17.44.236) at b8:e8:56:33:e4:8a [ether] on wlp3s0