Solved: VS Failover on R81.20 Take 26 - fw_full spike_dete...

PetterD · ‎2023-12-11

Hello,

I have a customer with a Check Point VSX VSLS cluster with 2 nodes and 6 VS`es.
All nodes are defined to always run on one of the nodes at the HQ while the DC is Standby.

Customer was running on R81.10 on 5600 Appliances but decided to do a HW-refresh and also software update.
So we replaced this cluster with a R81.20 T26 7000 Appliance Cluster which is alot more powerful.

Existing network infrastructure/cabling/switches/ports are being used other than new Check Point Appliances and versions.

After this migration, all 6 VS`es are failing back/forth between the VSX-clusters, complaining about CCP-packets, Sync and other intercaces.
This happens every 1-3-5-8 hours and every time it happens it causes some network issues, vpn-tunnel to third party fails etc.

Right before a failover happens we notice a "spike" for the "fw_full" process, however we dont find anything in the .elg logfiles:

##############
11 11:37:30 2023 fw-vsxnode1 spike_detective: spike info: type: thread, thread id: 1281, thread name: fw_full, start time: 11/12/23 11:37:18, spike duration (sec): 11, initial cpu usage: 99, average cpu usage: 99, perf taken: 0
Dec 11 11:37:48 2023 fw-vsxnode1 spike_detective: spike info: type: cpu, cpu core: 2, top consumer: fw_full, start time: 11/12/23 11:37:29, spike duration (sec): 18, initial cpu usage: 95, average cpu usage: 80, perf taken: 0
Dec 11 11:37:48 2023 fw-vsxnode1 spike_detective: spike info: type: thread, thread id: 1281, thread name: fw_full, start time: 11/12/23 11:37:35, spike duration (sec): 12, initial cpu usage: 99, average cpu usage: 80, perf taken: 0

Dec 11 11:38:49 2023 fw-vsxnode1 fwk: CLUS-110305-1: State change: ACTIVE -> ACTIVE(!) | Reason: Interface wrp768 is down (Cluster Control Protocol packets are not received)
Dec 11 11:38:51 2023 fw-vsxnode1 fwk: CLUS-114904-1: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved

Dec 11 11:38:54 2023 fw-vsxnode1 fwk: CLUS-110305-1: State change: ACTIVE -> ACTIVE(!) | Reason: Interface wrp768 is down (Cluster Control Protocol packets are not received)
Dec 11 11:38:57 2023 fw-vsxnode1 fwk: CLUS-110305-1: State remains: ACTIVE! | Reason: Interface Sync is down (Cluster Control Protocol packets are not received)
##############

We of course registered a High Priority case with TAC (4+ days ago) but not much help or replies so far even with a Manager Escalation, so we are getting a bit frustrated 🙂

Any hint/tricks on where i should look to investigate this issue further myself ?

CCSM / CCSE / CCVS / CCTE

PetterD · ‎2023-12-20

We are now 13 hours after upgrade to T41 and also verified that the kernel parameter is set to the same as in fwkern.conf and have not had any VS failovers.

We usually had them every 1-2 hours, up to 5 hours so so far its looking very good!

Will probably never know if it was T26 or the kernel parameter that was the root cause of this and frankly after 24/7 headache for several weeks I dont really care 😄

We have also had a strange issue with latency in logging that only affected Firewalls (non VSX, including VE and various Appliance models) that the Smartcenter controls via these VS`s. Log for these were lagging 2-3 hours behind (making traffic investigations a pain) but due to this other issue with VSX/VS we didnt have the time to look at it.. This issue is now also solved. As we didnt get approval to patch the Management at the same time the issue has to be a traffic related issue and the VS`es breaking log-traffic..

Policy push also seems to be noticable faster. So far it looks like Santa came early this year 🙂

CCSM / CCSE / CCVS / CCTE

View solution in original post

the_rock · ‎2023-12-11

Maybe run below script and send the results, should help us.

Andy

https://community.checkpoint.com/t5/Scripts/S7PAC-Super-Seven-Performance-Assessment-Commands/td-p/4...

PetterD · ‎2023-12-11

Thanks for a quick reply! 🙂

Customer is very strict on any software to install as it controls important national infrastructure so any third party tool has to go through review first unfortunately so will take a while before i could run it 😕

Performance shouldnt be an issue as we went from 5600 (8 cores) to 7000 (16 cores, 32 virtual) and the issues started immidiately started after shifting from 5600 R81.10 -> 7000 R81.20 T26, even switchports are the same as before so i definately suspect a bug

CCSM / CCSE / CCVS / CCTE

the_rock · ‎2023-12-11

I can tell you based on all my testing in the lab with jumbo 41, its FANTASTIC! I find traffic is better, cpu/memory as well, also policy push is faster than before.

Definitely something to consider, but, you can certainly also work with TAC to get better understanding as to what could be causing this.

Kind regards,

Andy

PetterD · ‎2023-12-11

Im really eager to try it out myself, took us a month to get a maintenance window to do this change, do know ill be able to get an emergency change for jumbo patching but want to make sure we dont also need a portfix at the same time.Hopefully TAC will reply soon 🙂

CCSM / CCSE / CCVS / CCTE

the_rock · ‎2023-12-11

Lets hope so. I also see take 41 is indeed recommended take at the moment, so would not surprise me if they suggest it to you.

Best regards,

Andy

PetterD · ‎2023-12-19

During investigations with TAC we found a strange mismatch on a kernel parameter.

We did not create/modify the fwkern.conf file ourselves but find the following:

####
[Expert@fw-vsxnode1:0]# cat $FWDIR/boot/modules/fwkern.conf
fwha_enable_state_machine_by_vs=1
fwha_active_standby_bridge_mode=1
fwmultik_sync_processing_enabled=0
####

When checking the "fwmultik_sync_processing_enabled" parameter it shows as enabled.

####
[Expert@fw-vsxnode1:0]# fw ctl get int fwmultik_sync_processing_enabled
fwmultik_sync_processing_enabled =1
####

The gateways havent been rebooted since after the implementation/reconfigure process where they were both rebooted.
Not sure why the kernel parameter differs from fwkern.conf but it has clearly been added here but not loaded.

sk105762 is also somewhat confusing, under instructions it says

"Note: Beginning in Jumbo Hotfix Accumulator R80.40 Take 78 (PRJ-13177) and on all GA versions starting R81, this mode is supported when the Security Gateway is configured in VSX/USFW mode. In a cluster environment, this procedure does not have to be performed on all members of the cluster because it enables monitoring only
"

Hoping that the boot and/or T41 will also fix this, but most importantly solve the failover issue without introducing new bugs 🙂

CCSM / CCSE / CCVS / CCTE

the_rock · ‎2023-12-19

I have high hopes, as they say. Take 41 seems super stable, so Im sure it will help. Just make sure right values are indeed set in fwkern.conf file, so the sync processing parameter is supposed to equal 1 and not 0, so should be like this:

[Expert@fw-vsxnode1:0]# cat $FWDIR/boot/modules/fwkern.conf
fwha_enable_state_machine_by_vs=1
fwha_active_standby_bridge_mode=1
fwmultik_sync_processing_enabled=1

Let us know how it goes mate.

Andy

PetterD · ‎2023-12-19

Thanks! 🙂

Is the "fwmultik_sync_processing_enabled" supposed to equal 1 and not 0?

I checked another R81.20 VSX installation and here it was "0" in both fwkern.conf and with "fw ctl get"
Also checked another R81.10 VSX installation, and here there were nothing in fwkern.conf but "fw ctl get" showed it was set to 0.

On this installation its set to 0 in fwkern.conf (not by me) but to "1" in kernel 😄

CCSM / CCSE / CCVS / CCTE

the_rock · ‎2023-12-19

Sorry, geesh, I am like chatGPT today, all over the place lol

You are right, what you sent is correct, just needs a reboot to update properly. Seems like someone set it on a fly, which would explain the output of other command.

Andy

PetterD · ‎2023-12-19

Haha, Great, was getting even more confused here 😄

Nobody else touches this cluster but me so the parameter has to have been set this way in fwkern.conf automagically but for some reason not loaded correctly. Crossing fingers for tonight 🙂

CCSM / CCSE / CCVS / CCTE

the_rock · ‎2023-12-19

Yes, that makes sense...dont worry mate, only one confused here is me 😄

I am positive about tonight that all will go well.

Andy

PetterD · ‎2023-12-20

We are now 13 hours after upgrade to T41 and also verified that the kernel parameter is set to the same as in fwkern.conf and have not had any VS failovers.

We usually had them every 1-2 hours, up to 5 hours so so far its looking very good!

Will probably never know if it was T26 or the kernel parameter that was the root cause of this and frankly after 24/7 headache for several weeks I dont really care 😄

We have also had a strange issue with latency in logging that only affected Firewalls (non VSX, including VE and various Appliance models) that the Smartcenter controls via these VS`s. Log for these were lagging 2-3 hours behind (making traffic investigations a pain) but due to this other issue with VSX/VS we didnt have the time to look at it.. This issue is now also solved. As we didnt get approval to patch the Management at the same time the issue has to be a traffic related issue and the VS`es breaking log-traffic..

Policy push also seems to be noticable faster. So far it looks like Santa came early this year 🙂

CCSM / CCSE / CCVS / CCTE

the_rock · ‎2023-12-20

Awesome news @PetterD , glad its fixed!

Andy

Are you a member of CheckMates?

VS Failover on R81.20 Take 26 - fw_full spike_detective