Bond interfaces go down after cluster member enter...

Kaspars_Zibarts · ‎2021-11-15

Hi! Just wondering if anyone else has seen this very strange cluster behaviour after reboot.

We are running R80.40 T120, regular gateways, non-VSX

In short sequence is as follows:

reboot cluster member (i.e. FW1 that is STANDBY)
FW1 recovers and synchronises connections
FW1 cluster state enters STANDBY
few seconds later all bond members report lost link
FW1 cluster state goes DOWN
interface driver seems to be reloaded and interfaces become available again
FW cluster state enters STANDBY again

Below I have full message log with comments. It seems that interface settings are modified after cluster has entered STANDBY state and that causes all bond members to go down. Regular interfaces seem to survive without link down.

The problem isn't that big if you maintain cluster state as is. But if you use "switch to higher priority member", you may end up in situation where FW1 goes ACTIVE > DOWN > ACTIVE after reboot and it has shown heaps of problems in our production network.

This is present on all our clusters and with different bond member types (i.e. 10Gbps drivers ixgbe or i40e, or 1Gbps - igb)

Danny · ‎2021-11-15

Interesting. Thanks for the detailed description. As these things happen within just a few seconds at system start it might not be easy to monitor the bond status closely via cphaprob show_bond and awk '{print FILENAME ":" $0}' /proc/net/bonding/bond*.

While I have no solution at the moment, I'd like to propose a workaround:

Create a permanent faildevice with Status:problem so starting gateways will not become active automatically so quickly
Set the status of the faildevice to be OK via a scheduled job 30secs after system start

Kaspars_Zibarts · ‎2021-11-15

Very cleaver actually Danny! Since it's not a major drama and we only noticed that on one cluster that was set to "switch to higher priority member", current workaround will be to change the mode to keep current active. So we avoid unnecessary issues with traffic. Case is open with TAC so hopefully we can get to the bottom of it 🙂 Obviously I'll post any findings / fixes here

Kaspars_Zibarts · ‎2021-11-15

I actually suspect it's on physical interface level and PHC function as straight after that mod interface goes down

But that's a wild guess hehe

Nov 14 08:14:03 2021 fw1 kernel: ixgbe 0000:05:00.0 eth1-01: enabling UDP RSS: fragmented packets may arrive out of order to the stack above
Nov 14 08:14:03 2021 fw1 kernel: ixgbe 0000:05:00.0: removed PHC on eth1-01 <<<<<
Nov 14 08:14:03 2021 fw1 xpand[3503]: Configuration changed from localhost by user admin by the service dbset
Nov 14 08:14:04 2021 fw1 kernel: bond0: link status down for interface eth1-01, disabling it in 200 ms
Nov 14 08:14:04 2021 fw1 last message repeated 16 times
Nov 14 08:14:04 2021 fw1 kernel: ixgbe 0000:05:00.0: registered PHC device on eth1-01

Markus_Genser · ‎2021-11-15

What is the CPU usage while this is happening? Are the SNDs fully utilized?

A while back with R77.30 I had a similar issue with a cluster, where the bond would suddenly go down and cause a cluster fail-over (the other way around) due to a failed check for a bond interface.

In the end it turned out the gateway was under high load at certain moments, this was due to application control being overwhelmed by a not tuned App control policy. The high load caused Gaia OS to process the interface queue not fast enough and the bond went down.

If the load is to high on your gateway after the fail-over this might also be a reason for your issue.

Kaspars_Zibarts · ‎2021-11-16

thanks @Markus_Genser but CPUs are not utilised at all. Remember that cluster member just booted and entered STANDBY state 🙂 but it's always good to know such tricks!

Kaspars_Zibarts · ‎2021-11-19

Interesting find! Tried the workaround but it didn't help but it feels like our case is actually related as we see RSS UDP being enabled (=MQ) just before interfaces go down. Have a feeling that there is a connection between MQ being configurd during boot sequence and and bond interfaces

sk173928

Cause

The Security Gateway uses fwstarts to initiate the modules loading sequence. A change was made to shorten the OS boot loading time, due to this, the multi-queue configuration does not have enough time to load before the next sequence starts, which causes the interfaces to not load properly.

Kaspars_Zibarts · ‎2021-11-22

OK I think I am on something! Whilst TAC was chasing me to get HW diagnostics done on the box (despite the fact that we had seen exactly the same symptoms in multiple clusters in different continents) ... I managed to pinpoint issue to MQ. If MQ is enabled on a bond interface, it will go down during boot sequence one extra time. Tests below were done in the lab where I had only one cluster member available, so clustering state will look odd, but it clearly shows the difference with MQ OFF and ON:

Danny · ‎2021-11-22

Thanks for keeping us updated on your findings! 👍

Kaspars_Zibarts · ‎2022-03-03

Still waiting for R&D to come up with the goods 😞 taken way too long, so took matters in own hands. Basically used @Danny idea about custom PNOTE.

Modified parts of $FWDIR/bin/fwstart, first in the middle:

if ($?highavail && ! $?IS_VSW) then
        if ($fw1_vsx == 0) then
                $FWDIR/bin/cphaconf set_pnote -f $FWDIR/conf/cpha_global_pnotes.conf -g register
        endif
endif

# Test manual cluster DOWN till MQ reconf is applied at the end
echo "*** TEST cluster DOWN ***"
$FWDIR/bin/cphaconf set_pnote -d ForceDown -t 1 -s init register

# load sim settings (affinity)
if ((! $?VS_CTX) && ($linux == 1)) then
        if ($?PPKDIR) then
                $FWDIR/bin/fwaccel on
                $FWDIR/bin/sxl_stats update_ac_name
                if ($fw1_ipv6) then
                        $FWDIR/bin/fwaccel6 on
                endif
--
        endif
endif

and then remove PNOTE after MQ is reconfigured.

(Actually $FWDIR/bin/mq_mng_reconf_all_vs command is the one that causes the problem)

# Kernel 3.10 - apply MQ settings for all VS
##release lock of mq_mng
rm /tmp/mq_reconf_lock >& /dev/null
if ("$linuxver" != "2.6") then
        $FWDIR/bin/mq_mng_reconf_all_vs >& /dev/null
endif

# Test manual cluster DOWN till bond MQ is applied - finish
echo "*** TEST cluster DOWN finsih ***"
sleep 15
$FWDIR/bin/cphaconf set_pnote -d ForceDown unregister


if ((! $?VS_CTX) && ($linux == 1)) then
        # Apply Backplane Ethernet affinity settings
        if ( -e /dev/adp0) then
                /etc/ppk.boot/bin/sam_mq.sh
        endif
endif

So startup sequence looks like this now:

And clustering messages confirm that there are no unwanted failovers:

Kaspars_Zibarts · ‎2022-05-02

After 4 or 5 month battle with TAC and R&D I finally have the answer! Turns out I was right from the start, the changes made in startup sequence mentioned in sk173928 are causing the problems! You basically need to do the opposite of what SK suggests and move MQ reconfigure upwards instead of end of the script.

KZ 1:0 CP 8)

_Val_ · ‎2022-05-02

Hmmmm, the SK says, reach out for a fix. Why did you try to apply a workaround instead?

Are you a member of CheckMates?

Bond interfaces go down after cluster member enters STANDBY / ACTIVE after reboot