topic Re: Check Point VSX Cluster - Virtual Systems down after upgrade from R80.20 to R81.10 in Firewall and Security Management

Check Point VSX Cluster - Virtual Systems down after upgrade from R80.20 to R81.10

Edward_Waithaka — Tue, 12 Jul 2022 15:49:29 GMT

We recently upgraded a VSX VSLS cluster from R80.20 to R81.10 HF55 and the VSs are now reported as DOWN due to FWD pnote (FWD on Active VSX cluster member VSs are in Terminated state T).

Only one VS is in Active/Standby state while the rest are Active/Down.

Anyone faced such an issue recently?

See sample output.

Cluster name: CP-Cluster

Virtual Devices Status on each Cluster Member
=============================================

ID | Weight| CP-G| CP-G
| | W-1 | W-2
| | [local] |
-------+-------+-----------+-----------
2 | 10 | DOWN | ACTIVE
6 | 10 | ACTIVE(!) | DOWN
7 | 10 | DOWN | ACTIVE(!)
8 | 10 | DOWN | ACTIVE(!)
9 | 10 | DOWN | ACTIVE(!)
10 | 10 | STANDBY | ACTIVE
---------------+-----------+-----------
Active | 1 | 5
Weight | 10 | 50
Weight (%) | 16 | 84

Legend: Init - Initializing, Active! - Active Attention
Down! - ClusterXL Inactive or Virtual System is Down

Re: Check Point VSX Cluster - Virtual Systems down after upgrade from R80.20 to R81.10

Chris_Atkinson — Wed, 13 Jul 2022 02:49:21 GMT

Hopefully you are engaged with TAC on this issue, what was already attempted in respect to troubleshooting & recovery?

What process was followed to complete the upgrade, are both members upgraded at this point?

Re: Check Point VSX Cluster - Virtual Systems down after upgrade from R80.20 to R81.10

KostasGR — Wed, 13 Jul 2022 12:39:28 GMT

Hello @Edward_Waithaka

Did you try to reboot the VSX gateways?

BR,
Kostas

Re: Check Point VSX Cluster - Virtual Systems down after upgrade from R80.20 to R81.10

Edward_Waithaka — Sun, 17 Jul 2022 14:31:14 GMT

TAC was involved but not much done apart from collection of cpinfo.
The MVC upgrade option was used.
Both gateways had been upgraded but we ended up rolling back one gateway back to R80.20 using a snapshot.
We attempted to rollback the 2nd gateway using a snapshot and it ended up corrupting the disks leading to a boot loop. We did a fresh install of R81.10 and reconfigured vsx.
The reinstalled R81.10 gateway is having intermittent performance issues in that traffic seems to be "dropped/lost" (TCP retransmissions seen on client-end Wireshark, no drops seen on zdebug + drop). When we failover to R80.20 gateway, all is well. We suspect default settings (see below) of CoreXL and MQ might be leading to the issues.
Action plan -
- Increase number of SNDs to 12 - fw ctl affinity -s -d -fwkall 36
- Disable CoreXL on VS0 using cpconfig, assign VS0 instances manually - fw ctl affinity -s -d -vsid 0 -cpu 6 30
- Change MQ on Mgmt & Sync (bond0)?? - mq_mng -s manual -i Mgmt -c 5 29 AND mq_mng -s manual -i eth2-01 eth2-01 -c 4 28
- If performance improves on GW2, reimage GW1 and reconfigure it with similar settings as above.

@Chris_Atkinson what is the effect of enabling dynamic_balancing after making the above changes?

Is it better to leave defaults and enable dynamic_balancing ?

--- Current CoreXL affinity & MQ settings ----------------------------------

[Expert@CP-GW2:0]# fw ctl affinity -l -a
VS_0 fwk: CPU 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
VS_1 fwk: CPU 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
VS_2 fwk: CPU 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
VS_3 fwk: CPU 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
VS_4 fwk: CPU 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
VS_6 fwk: CPU 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
VS_7 fwk: CPU 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
VS_8 fwk: CPU 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
VS_9 fwk: CPU 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
VS_10 fwk: CPU 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Interface Mgmt: has multi queue enabled
Interface eth2-01: has multi queue enabled
Interface eth2-02: has multi queue enabled
Interface eth4-03: has multi queue enabled
Interface eth4-04: has multi queue enabled
Interface eth5-01: has multi queue enabled
Interface eth5-02: has multi queue enabled
Interface eth1-01: has multi queue enabled
Interface eth1-02: has multi queue enabled
[Expert@CP-GW2:0]#

[Expert@CP-GW2:0]# cpmq get -vv
Note: 'cpmq' is deprecated and no longer supported. For multiqueue management, please use 'mq_mng'
Current multiqueue status:
Total 48 cores. Available for MQ 4 cores
i/f driver driver mode state mode (queues) cores
actual/avail
------------------------------------------------------------------------------------------------
Mgmt igb Kernel Up Auto (2/2) 0,24
eth1-01 ixgbe Kernel Up Auto (4/4) 0,24,1,25
eth1-02 ixgbe Kernel Up Auto (4/4) 0,24,1,25
eth2-01 igb Kernel Up Auto (4/4) 0,24,1,25
eth2-02 igb Kernel Up Auto (4/4) 0,24,1,25
eth4-03 ixgbe Kernel Up Auto (4/4) 0,24,1,25
eth4-04 ixgbe Kernel Up Auto (4/4) 0,24,1,25
eth5-01 mlx5_core Kernel Up Auto (4/4) 0,24,1,25
eth5-02 mlx5_core Kernel Up Auto (4/4) 0,24,1,25

Re: Check Point VSX Cluster - Virtual Systems down after upgrade from R80.20 to R81.10

Edward_Waithaka — Sun, 17 Jul 2022 14:33:17 GMT

@KostasGR yes we tried to reboot.

Re: Check Point VSX Cluster - Virtual Systems down after upgrade from R80.20 to R81.10

Chris_Atkinson — Mon, 18 Jul 2022 00:56:40 GMT

A prerequisite to start Dynamic Balancing, is having all FWKs set to the default FWKs CPUs.

@AmitShmuel talks about it here:

https://community.checkpoint.com/t5/Security-Gateways/dynamic-balancing-VSX-and-core-affinity/td-p/146273

Re: Check Point VSX Cluster - Virtual Systems down after upgrade from R80.20 to R81.10

Edward_Waithaka — Mon, 18 Jul 2022 04:16:17 GMT

I will assume default settings means out of the box configuration with coreXL enabled on VS0 with 40 instances. From there dynamic balancing is assumed to do its magic.

Also, On further analysis using FW monitor, we also suspect that our Mellanox 2x40G NIC cards might be having issues while in bonded state. We will also try a firmware upgrade of the cards after testing the coreXL/MQ settings.

Re: Check Point VSX Cluster - Virtual Systems down after upgrade from R80.20 to R81.10

Edward_Waithaka — Wed, 20 Jul 2022 17:14:23 GMT

Update

Seems the 40G NICs required a firmware update after moving to R81.10. Traffic was being dropped on the 40G bond interface causing TCP SYN retransmissions leading to slow loading of web applications. After upgrading the firmware all was good but we have ran into a new issue as seen below from TAC.

" It seems that you have experienced a segmentation fault that is recently common in the later takes of R81.10 - this segmentation fault is usually causing an FWK crash and initiates a failover.

We do not have an official SK about it as it is in an internal R&D investigation.

Please install the hotfix I have provided you with, as segmentation faults could harm the machine and it is very important to act quickly with this matter."

One of our VS crashed and failed over to the standby VS (still running on R80.20).

Re: Check Point VSX Cluster - Virtual Systems down after upgrade from R80.20 to R81.10

Markus_Genser — Thu, 21 Jul 2022 13:20:29 GMT

Are you noticing high CPU load for specific VS?

I ran into a similar issue after a VSX upgrade from R80.20 -> R81.10 but in HA mode.

Our standby member was flapping between Standby and DOWN due to missing Sync interface and the virtual router VS was missing its interfaces.

In addition the virtual router VS was consuming CPU up to 200% and a lot of Interface DRPs and OVR overall were visible.

We started tuning SNDs and FWKs but no change was resolving the issue.

Thanks to the last support engineer, who found a similar case, we were able to pinpoint it to priority queue and after disabling it the cluster became stable.

It would be worth in your case to check, if the spike detective is printing errors regarding the fwkX_hp process.

This one is responsible for the high priority queue in priority queue and clogged the specific VS in our case.

BR,

Markus

Re: Check Point VSX Cluster - Virtual Systems down after upgrade from R80.20 to R81.10

Edward_Waithaka — Thu, 21 Jul 2022 15:20:45 GMT

@Markus_Genser could you share the commands used to identify the issue and how to kill the priority queues.

Re: Check Point VSX Cluster - Virtual Systems down after upgrade from R80.20 to R81.10

Markus_Genser — Fri, 22 Jul 2022 06:28:41 GMT

Sure,

spike detective is reporting to /var/log/messages,

we got the following messages over and over again, especially during peak times with a lot of traffic passing through, note the fwk1_hp (according to TAC this is the high priority queue)

Jul 20 15:28:51 2022 <GWNAME> spike_detective: spike info: type: thread, thread id: 3383, thread name: fwk1_hp, start time: 20/07/22 15:28:26, spike duration (sec): 24, initial cpu usage: 100, average cpu usage: 100, perf taken: 0 Jul 20 15:28:57 2022 <GWNAME> spike_detective: spike info: type: cpu, cpu core: 5, top consumer: fwk1_hp, start time: 20/07/22 15:28:26, spike duration (sec): 30, initial cpu usage: 84, average cpu usage: 79, perf taken: 1 Jul 20 15:29:03 2022 <GWNAME> spike_detective: spike info: type: cpu, cpu core: 21, top consumer: fwk1_hp, start time: 20/07/22 15:28:56, spike duration (sec): 6, initial cpu usage: 85, average cpu usage: 85, perf taken: 0 Jul 20 15:29:03 2022 <GWNAME> spike_detective: spike info: type: thread, thread id: 3383, thread name: fwk1_hp, start time: 20/07/22 15:28:56, spike duration (sec): 6, initial cpu usage: 100, average cpu usage: 100, perf taken: 0

For the virtual router, which is normally using 5-10% cpu, this is not normal

The rest was good old detective work with top to identify the VS causing the issue (virtual router in our case).

In the VS $FWDIR/log/fwk.elg to verify that the cluster status is caused by missing CCP packets, messages like this:

State change: ACTIVE -> ACTIVE(!) | Reason: Interface Sync is down (Cluster Control Protocol packets are not received)

even though interfaces are up and you can ping the neighbour and we had enough SNDs to handle all the traffic and all in all the rest of the VS were consuming cpu in a low level, which showed most of the cores idling.

The TAC engineer used one additional command in the remote session, which i failed to note, that showed the various CPU hits for the kernel modules in percent which also displayed the fwk1_hp on top and following all that, he suggested to turn of priority queue.

To deactivate priority queue :

fw ctl multik prioq
Select 0 to disable the feature.
Reboot the device.

BR,

Markus

Re: Check Point VSX Cluster - Virtual Systems down after upgrade from R80.20 to R81.10

Edward_Waithaka — Fri, 22 Jul 2022 06:51:40 GMT

@Markus_Genser thanks.

My take is that the merger of the R80.X SP (Maestro etc) with the normal R80.X has brought about many issues on R81.X platform. Just like the move from SecurePlatform to Gaia OS a few years ago. Running on R81.X is like walking on egg shells in production.

Re: Check Point VSX Cluster - Virtual Systems down after upgrade from R80.20 to R81.10

_Val_ — Fri, 22 Jul 2022 07:13:10 GMT

Did you push policies on them as well, after the upgrade?

Re: Check Point VSX Cluster - Virtual Systems down after upgrade from R80.20 to R81.10

Edward_Waithaka — Fri, 22 Jul 2022 07:22:26 GMT

Yes, we pushed several times.

Re: Check Point VSX Cluster - Virtual Systems down after upgrade from R80.20 to R81.10

Gera_Dorfman — Tue, 26 Jul 2022 18:42:36 GMT

Dear Edward

I am sorry to hear that you had a bad experience with R81.10. We are investigating the reason for VSs to be down. As of the network interfaces firmware we moved towards auto-updates in future versions of the jumbo and major versions;

In general we get a great feedback about R81.10 quality from our partners and customers and we highly recommend upgrading to this version - not only management and regular gateways, but also Maestro environments, and actually the most widely used Maestro version is now R81.10 with the large number of Maestro customers that succesfully upgraded to R81.10.

Thank You

Re: Check Point VSX Cluster - Virtual Systems down after upgrade from R80.20 to R81.10

genisis__ — Tue, 26 Jul 2022 20:48:33 GMT

We just did a R80.40 with R81.10 with JHFA66 (inplace upgrade).

We had an odd issue with one node where some wrp interface just appeared in VS0 and that cause a problem with HA.

We ended up removing VSX from the GW and then running vsx_util reconfigure which resolve the problem.

My main issues since upgrade is multicast has stopped working and we have determine the 'fw monitor' causes fwk process crash (yes TAC are engaged) and basically causes an outage.

What other surprises are waiting to be discovered..only time will tell.

b.t.w I am using dynamic balancing and so far not seeing an issue with this, it just works.

Re: Check Point VSX Cluster - Virtual Systems down after upgrade from R80.20 to R81.10

Edward_Waithaka — Wed, 27 Jul 2022 06:55:51 GMT

All the best. We just hit another issue today!

Re: Check Point VSX Cluster - Virtual Systems down after upgrade from R80.20 to R81.10

Edward_Waithaka — Wed, 27 Jul 2022 07:10:19 GMT

Hi @Markus_Genser

Today we did a fresh install upgrade of our 2nd gateway and we hit another problem.

The VSs on GW2 are complaining that the cluster interface is down hence CCP packets are not being received. This causes the VS to failover in some instances. We are sure the bond interface is fine as everything was working well in the multiversion cluster (R80.20 + R81.10). This just started with the final upgrade of the 2nd GW.

The sync bond interface within the VS is UP in one direction only as seen in the output below.

[Expert@CP-GW2:6]# cphaprob -a if vsid 6: ------ CCP mode: Manual (Unicast) Required interfaces: 8 Required secured interfaces: 1 Interface Name: Status: bond0 (S-LS) Inbound: UP Outbound: DOWN (1245.8 secs) [Expert@CP-GW2:6]# cphaprob stat Active PNOTEs: LPRB, IAC Last member state change event: Event Code: CLUS-110305 State change: ACTIVE -> ACTIVE(!) Reason for state change: Interface bond0 is down (Cluster Control Protocol packets are not received)

TAC have tried to analyze interface packet details to no avail. We will try a full reboot later on.

My advise is for customers to work with R80.40 for VSX based on the number of issues being faced. I believe that version is fairly mature and has been out there for long.

Re: Check Point VSX Cluster - Virtual Systems down after upgrade from R80.20 to R81.10

Gera_Dorfman — Wed, 27 Jul 2022 07:20:16 GMT

Hi @Edward_Waithaka

May I suggest that customer success engineer from R&D organization will work with you to review the environment and to help and resolve R81.10 issues that you're having ?

Thanks

Gera

Re: Check Point VSX Cluster - Virtual Systems down after upgrade from R80.20 to R81.10

Edward_Waithaka — Wed, 27 Jul 2022 07:35:32 GMT

The ccp probe issue seems to clear on its own sometimes, then comes back later on.

[Expert@CP-GW2:2]# cphaprob stat Cluster Mode: Virtual System Load Sharing (Primary Up) ID Unique Address Assigned Load State Name 1 10.10.100.169 0% STANDBY CP-GW-1 2 (local) 10.10.100.170 100% ACTIVE CP-GW-2 Active PNOTEs: None Last member state change event: Event Code: CLUS-114904 State change: ACTIVE(!) -> ACTIVE Reason for state change: Reason for ACTIVE! alert has been resolved Event time: Wed Jul 27 09:43:58 2022 Last cluster failover event: Transition to new ACTIVE: Member 1 -> Member 2 Reason: Available on member 1 Event time: Thu Jul 21 00:26:52 2022 Cluster failover count: Failover counter: 1 Time of counter reset: Thu Jul 21 00:17:24 2022 (reboot) [Expert@CP-GW2:2]# cat $FWDIR/log/fwk.elg | grep '27 Jul 9:43:' [27 Jul 9:43:58][fw4_0];[vs_2];CLUS-120207-2: Local probing has started on interface: bond0 [27 Jul 9:43:58][fw4_0];[vs_2];CLUS-120207-2: Local probing has started on interface: bond2.xx [27 Jul 9:43:58][fw4_0];[vs_2];CLUS-120207-2: Local probing has stopped on interface: bond2.xx [27 Jul 9:43:58][fw4_0];[vs_2];CLUS-120207-2: Local Probing PNOTE OFF [27 Jul 9:43:58][fw4_0];[vs_2];CLUS-114904-2: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved