Solved: R80.40 Cluster-Interfaces down

MladenAntesevic · ‎2020-08-15

Hi,

We've configured 5600 cluster (HA) and we see 4 bond VLAN subinterfaces are down on both Active and Standby firewall. Besides these four VLAN subinterfaces we have external eth1 interface UP, directly connected bond10 as a sync also UP (these are direct cables between two members) and bond1 as inside also UP.

[Expert@CP1:0]# cphaprob -a if

CCP mode: Manual (Unicast)
Required interfaces: 3
Required secured interfaces: 1

Interface Name: Status:

eth1                 UP
Mgmt                 Non-Monitored
bond1 (LS)           UP
bond10 (S-LS)        UP
bond4.5 (LS)         DOWN (58713 secs)
bond4.42 (LS)        DOWN (58713 secs)

S - sync, LM - link monitor, HA/LS - bond type

Virtual cluster interfaces: 6

eth1    <public_ip1>
bond1 x.y.4.254
bond4.6 x.y.6.254
bond4.5 x.y.5.254
bond4.42    x.y.42.254
bond4.41    x.y.41.254

We have the same output for the second cluster member. We have the same software release on both cluster members:

[Expert@CP1:0]# cphaprob release

Release: R80.40 T294

Kernel build:           994000089
FW1 build:              994000101
FW1 private fixes:      HOTFIX_TEX_ENGINE_R8040_AUTOUPDATE
                        HOTFIX_R80_40_JUMBO_HF_MAIN

ID SW release

1 (local) R80.40 T294
2 R80.40 T294

bond1 and bond4 interfaces are interconnected over two Cisco Nexus 9300 switches. We double checked the cables and VLAN configuration and everything is fine. One more strange thing that we noticed is that bond interfaces are sending ARPs targeting whole X.Y.5.0/24 subnet, for example:

[Expert@CP1:0]# tcpdump -i bond4.5
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on bond4.5, link-type EN10MB (Ethernet), capture size 262144 bytes
16:23:32.485524 ARP, Request who-has X.Y.5.66 tell X.Y.5.251, length 28
16:23:32.485529 ARP, Request who-has X.Y.5.67 tell X.Y.5.251, length 28
16:23:32.485530 ARP, Request who-has X.Y.5.68 tell X.Y.5.251, length 28
16:23:32.485531 ARP, Request who-has X.Y.5.69 tell X.Y.5.251, length 28
16:23:32.485532 ARP, Request who-has X.Y.5.70 tell X.Y.5.251, length 28
16:23:32.485551 ARP, Request who-has X.Y.5.252 tell X.Y.5.251, length 28
16:23:32.585510 ARP, Request who-has X.Y.5.71 tell X.Y.5.251, length 28
16:23:32.585513 ARP, Request who-has X.Y.5.72 tell X.Y.5.251, length 28
...

What could be the reason why is this happening? We are pretty sure that interconnecting switches are properly configured.

Timothy_Hall · ‎2020-08-16

Please provide output of cphaprob show_bond -a to ensure ClusterXL is OK with your bond configuration.

Whenever using tcpdump, pass the -p flag to disable promiscuous mode during your capture. Promiscuous mode will still show you frames that aren't actually going to be processed by the receiving system due to a MAC address mismatch which is a classic example of the observer effect sabotaging your troubleshooting efforts.

Finally, try pulling out one physical interface from your bond (eth4 or eth5) on both the firewall and switch side so that it is a "bond of one" and see what happens. If the problem goes away it is probably something in your bond setup on the switch.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

View solution in original post

Timothy_Hall · ‎2020-08-15

Do you have at least one other responding IP address (such as a switch or router) on interfaces bond4.5 & bond4.42? If not the interfaces will be declared down by ClusterXL even though connectivity between the cluster members on those interfaces is working. The traffic you are seeing in your tcpdump is the cluster desperately trying to determine if anything else is alive on those interfaces, sounds like it is not finding anything thus the down state.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

MladenAntesevic · ‎2020-08-15

I have a Centos8 server in bond4.5 subnet (Centos8 address is .248) and it is replying to ARP destined to his address, I have just check it on my Centos8 server and I see it replies to both cluster members:

[root@Centos8 ~]# tcpdump -nn -i enp0s20f0u1.5 | grep 248
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on enp0s20f0u1.5, link-type EN10MB (Ethernet), capture size 262144 bytes
17:20:54.626647 ARP, Request who-has 10.100.5.248 tell 10.100.5.252, length 42
17:20:54.626654 ARP, Reply 10.100.5.248 is-at 00:e0:4c:36:01:e4, length 28
17:20:56.507907 ARP, Request who-has 10.100.5.248 tell 10.100.5.251, length 42
17:20:56.507923 ARP, Reply 10.100.5.248 is-at 00:e0:4c:36:01:e4, length 28

I am trying to see if reply is coming back to cluster member, but there i no reply on my cluster (although I am pretty sure interconnecting Cisco switch is properly configured), so just requests going out, no reply is seen on a firewall:

tcpdump -nn -i bond4.5 | grep 248
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on bond4.5, link-type EN10MB (Ethernet), capture size 262144 bytes
17:20:10.704519 ARP, Request who-has 10.100.5.248 tell 10.100.5.251, length 28
17:20:15.804511 ARP, Request who-has 10.100.5.248 tell 10.100.5.251, length 28
17:20:20.904513 ARP, Request who-has 10.100.5.248 tell 10.100.5.251, length 28
17:20:26.004509 ARP, Request who-has 10.100.5.248 tell 10.100.5.251, length 28
17:20:31.105515 ARP, Request who-has 10.100.5.248 tell 10.100.5.251, length 28
17:20:36.106517 ARP, Request who-has 10.100.5.248 tell 10.100.5.251, length 28

One more thing, I have no security policy defined for bond4.5 subnet, but anyway I believe ARP reply should be seen on tcpdump capture.

Vladimir · ‎2020-08-15

I know that this suggestion is a pain to implement, but it should conclusively show where the problem is: mirror the VLAN in question on the CIsco connected to the cluster member that is not seeing replies and perform packet capture on the span port.

I have seen situations where performing TCPDUMP on cluster members result in incomplete or misleading conclusions, whereas the problems were in redundant L2/3 Cisco segments.

The replies may be forwarded to the incorrect cluster member. You may also run TCPDUMP on it to see if my theory is correct.

MladenAntesevic · ‎2020-08-16

Hi Vladimir,

I was unable to do packet capture on the switch since my Cisco Nexus does not support SPAN on egress (TX), but anyway, I found out some very interesting facts:

My active cluster member is definitely receiving ARP replies, I can see replies coming in if I start tcpdump on a physical port eth5:

[Expert@CP1:0]# tcpdump -e -i eth5 | grep 248
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth5, link-type EN10MB (Ethernet), capture size 262144 bytes
15:01:15.789127 00:e0:4c:36:01:e4 (oui Unknown) > 00:1c:7f:8d:e3:4b (oui Unknown), ethertype 802.1Q (0x8100), length 64: vlan 5, p 0, ethertype ARP, Reply 10.100.5.248 is-at 00:e0:4c:36:01:e4 (oui Unknown), length 46

So replies are received and everything is regularly tagged with VLAN ID 5.

Further on, I have check my bond settings and they are fine, physical ports eth4 and eth5 are in bond4, as you can see from the attached screenshot.

One more interesting fact, cluster member is sending traffic through the eth4 and receiving traffic through the eth5, as I can see from the traffic statistics:

[Expert@CP1:0]# ifconfig eth4 | grep RX
RX packets:68 errors:0 dropped:0 overruns:0 frame:0
RX bytes:8315 (8.1 KiB) TX bytes:615298270 (586.7 MiB)
[Expert@CP1:0]# ifconfig eth5 | grep RX
RX packets:17502724 errors:0 dropped:0 overruns:0 frame:0
RX bytes:980433759 (935.0 MiB) TX bytes:586396 (572.6 KiB)

Further on, if I start the tcpdump on the main bond4 interface or bond4.5 subinterface, no replies are seen, so somehow ARP replies get lost between bond member eth5 and the corresponding bond interface where eth5 belongs. Maybe because traffic is sent over eth4 and it is received over eth5 it is somehow misleading cluster bond4 interface and traffics gets lost.

Timothy_Hall · ‎2020-08-16

Please provide output of cphaprob show_bond -a to ensure ClusterXL is OK with your bond configuration.

Whenever using tcpdump, pass the -p flag to disable promiscuous mode during your capture. Promiscuous mode will still show you frames that aren't actually going to be processed by the receiving system due to a MAC address mismatch which is a classic example of the observer effect sabotaging your troubleshooting efforts.

Finally, try pulling out one physical interface from your bond (eth4 or eth5) on both the firewall and switch side so that it is a "bond of one" and see what happens. If the problem goes away it is probably something in your bond setup on the switch.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

MladenAntesevic · ‎2020-08-16

Hi Timothy,

bond status is down as I described above, here is the output from cphaprob show_bond command:

[Expert@CP1:0]# cphaprob show_bond bond4.5

Bond name: bond4.5
Bond mode: Load Sharing
Bond status: DOWN

Balancing mode: 802.3ad Layer2 Load Balancing
Configured slave interfaces: 2
In use slave interfaces: 2
Required slave interfaces: 1

Also doing tcpdump with -p flag still shows ARP replies are coming into eth5:

[Expert@CP1:0]# tcpdump -p -e -i eth5 | grep 248
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth5, link-type EN10MB (Ethernet), capture size 262144 bytes
16:33:50.326391 00:e0:4c:36:01:e4 (oui Unknown) > 00:1c:7f:8d:e3:4b (oui Unknown), ethertype 802.1Q (0x8100), length 64: vlan 5, p 0, ethertype ARP, Reply 10.100.5.248 is-at 00:e0:4c:36:01:e4 (oui Unknown), length 46

I will try to pull the cables as you suggested in order to check if bond is working with just one member.

MladenAntesevic · ‎2020-08-16

Hi Timothy,

you are right, as you suggested there was a wrong port-channel configuration on a Cisco switch. Actually, it was a very basic mistake, port-channel on a Cisco side was statically defined, without any LACP nad we have 802.1ad LACP on the cluster side. I did not immediately recognized such a basic mistake, because traffic was actually flowing in one direction, not the opposite, so it was quite confusing.

Thank you for you help solving this issue.

Are you a member of CheckMates?

R80.40 Cluster-Interfaces down