- Products
- Learn
- Local User Groups
- Partners
- More
Quantum Spark Management Unleashed!
Check Point Named Leader
2025 Gartner® Magic Quadrant™ for Hybrid Mesh Firewall
HTTPS Inspection
Help us to understand your needs better
CheckMates Go:
SharePoint CVEs and More!
Hi! Just wondering if anyone else has seen this very strange cluster behaviour after reboot.
We are running R80.40 T120, regular gateways, non-VSX
In short sequence is as follows:
Below I have full message log with comments. It seems that interface settings are modified after cluster has entered STANDBY state and that causes all bond members to go down. Regular interfaces seem to survive without link down.
The problem isn't that big if you maintain cluster state as is. But if you use "switch to higher priority member", you may end up in situation where FW1 goes ACTIVE > DOWN > ACTIVE after reboot and it has shown heaps of problems in our production network.
This is present on all our clusters and with different bond member types (i.e. 10Gbps drivers ixgbe or i40e, or 1Gbps - igb)
Interesting. Thanks for the detailed description. As these things happen within just a few seconds at system start it might not be easy to monitor the bond status closely via cphaprob show_bond
and awk '{print FILENAME ":" $0}' /proc/net/bonding/bond*
.
While I have no solution at the moment, I'd like to propose a workaround:
Very cleaver actually Danny! Since it's not a major drama and we only noticed that on one cluster that was set to "switch to higher priority member", current workaround will be to change the mode to keep current active. So we avoid unnecessary issues with traffic. Case is open with TAC so hopefully we can get to the bottom of it 🙂 Obviously I'll post any findings / fixes here
I actually suspect it's on physical interface level and PHC function as straight after that mod interface goes down
But that's a wild guess hehe
Nov 14 08:14:03 2021 fw1 kernel: ixgbe 0000:05:00.0 eth1-01: enabling UDP RSS: fragmented packets may arrive out of order to the stack above
Nov 14 08:14:03 2021 fw1 kernel: ixgbe 0000:05:00.0: removed PHC on eth1-01 <<<<<
Nov 14 08:14:03 2021 fw1 xpand[3503]: Configuration changed from localhost by user admin by the service dbset
Nov 14 08:14:04 2021 fw1 kernel: bond0: link status down for interface eth1-01, disabling it in 200 ms
Nov 14 08:14:04 2021 fw1 last message repeated 16 times
Nov 14 08:14:04 2021 fw1 kernel: ixgbe 0000:05:00.0: registered PHC device on eth1-01
What is the CPU usage while this is happening? Are the SNDs fully utilized?
A while back with R77.30 I had a similar issue with a cluster, where the bond would suddenly go down and cause a cluster fail-over (the other way around) due to a failed check for a bond interface.
In the end it turned out the gateway was under high load at certain moments, this was due to application control being overwhelmed by a not tuned App control policy. The high load caused Gaia OS to process the interface queue not fast enough and the bond went down.
If the load is to high on your gateway after the fail-over this might also be a reason for your issue.
thanks @Markus_Genser but CPUs are not utilised at all. Remember that cluster member just booted and entered STANDBY state 🙂 but it's always good to know such tricks!
Interesting find! Tried the workaround but it didn't help but it feels like our case is actually related as we see RSS UDP being enabled (=MQ) just before interfaces go down. Have a feeling that there is a connection between MQ being configurd during boot sequence and and bond interfaces
The Security Gateway uses fwstarts to initiate the modules loading sequence. A change was made to shorten the OS boot loading time, due to this, the multi-queue configuration does not have enough time to load before the next sequence starts, which causes the interfaces to not load properly.
OK I think I am on something! Whilst TAC was chasing me to get HW diagnostics done on the box (despite the fact that we had seen exactly the same symptoms in multiple clusters in different continents) ... I managed to pinpoint issue to MQ. If MQ is enabled on a bond interface, it will go down during boot sequence one extra time. Tests below were done in the lab where I had only one cluster member available, so clustering state will look odd, but it clearly shows the difference with MQ OFF and ON:
Thanks for keeping us updated on your findings! 👍
Still waiting for R&D to come up with the goods 😞 taken way too long, so took matters in own hands. Basically used @Danny idea about custom PNOTE.
Modified parts of $FWDIR/bin/fwstart, first in the middle:
if ($?highavail && ! $?IS_VSW) then
if ($fw1_vsx == 0) then
$FWDIR/bin/cphaconf set_pnote -f $FWDIR/conf/cpha_global_pnotes.conf -g register
endif
endif
# Test manual cluster DOWN till MQ reconf is applied at the end
echo "*** TEST cluster DOWN ***"
$FWDIR/bin/cphaconf set_pnote -d ForceDown -t 1 -s init register
# load sim settings (affinity)
if ((! $?VS_CTX) && ($linux == 1)) then
if ($?PPKDIR) then
$FWDIR/bin/fwaccel on
$FWDIR/bin/sxl_stats update_ac_name
if ($fw1_ipv6) then
$FWDIR/bin/fwaccel6 on
endif
--
endif
endif
and then remove PNOTE after MQ is reconfigured.
(Actually $FWDIR/bin/mq_mng_reconf_all_vs command is the one that causes the problem)
# Kernel 3.10 - apply MQ settings for all VS
##release lock of mq_mng
rm /tmp/mq_reconf_lock >& /dev/null
if ("$linuxver" != "2.6") then
$FWDIR/bin/mq_mng_reconf_all_vs >& /dev/null
endif
# Test manual cluster DOWN till bond MQ is applied - finish
echo "*** TEST cluster DOWN finsih ***"
sleep 15
$FWDIR/bin/cphaconf set_pnote -d ForceDown unregister
if ((! $?VS_CTX) && ($linux == 1)) then
# Apply Backplane Ethernet affinity settings
if ( -e /dev/adp0) then
/etc/ppk.boot/bin/sam_mq.sh
endif
endif
So startup sequence looks like this now:
And clustering messages confirm that there are no unwanted failovers:
After 4 or 5 month battle with TAC and R&D I finally have the answer! Turns out I was right from the start, the changes made in startup sequence mentioned in sk173928 are causing the problems! You basically need to do the opposite of what SK suggests and move MQ reconfigure upwards instead of end of the script.
KZ 1:0 CP 8)
Hmmmm, the SK says, reach out for a fix. Why did you try to apply a workaround instead?
Leaderboard
Epsum factorial non deposit quid pro quo hic escorol.
User | Count |
---|---|
19 | |
12 | |
8 | |
7 | |
7 | |
6 | |
6 | |
4 | |
4 | |
3 |
Thu 18 Sep 2025 @ 03:00 PM (CEST)
Bridge the Unmanaged Device Gap with Enterprise Browser - EMEAThu 18 Sep 2025 @ 02:00 PM (EDT)
Bridge the Unmanaged Device Gap with Enterprise Browser - AmericasMon 22 Sep 2025 @ 03:00 PM (CEST)
Defending Hyperconnected AI-Driven Networks with Hybrid Mesh Security EMEAMon 22 Sep 2025 @ 02:00 PM (EDT)
Defending Hyperconnected AI-Driven Networks with Hybrid Mesh Security AMERThu 18 Sep 2025 @ 03:00 PM (CEST)
Bridge the Unmanaged Device Gap with Enterprise Browser - EMEAThu 18 Sep 2025 @ 02:00 PM (EDT)
Bridge the Unmanaged Device Gap with Enterprise Browser - AmericasMon 22 Sep 2025 @ 03:00 PM (CEST)
Defending Hyperconnected AI-Driven Networks with Hybrid Mesh Security EMEAAbout CheckMates
Learn Check Point
Advanced Learning
YOU DESERVE THE BEST SECURITY