- CheckMates
- :
- Products
- :
- Quantum
- :
- Security Gateways
- :
- Re: VS Failover on R81.20 Take 26 - fw_full spike_...
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Are you a member of CheckMates?
×- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
VS Failover on R81.20 Take 26 - fw_full spike_detective
Hello,
I have a customer with a Check Point VSX VSLS cluster with 2 nodes and 6 VS`es.
All nodes are defined to always run on one of the nodes at the HQ while the DC is Standby.
Customer was running on R81.10 on 5600 Appliances but decided to do a HW-refresh and also software update.
So we replaced this cluster with a R81.20 T26 7000 Appliance Cluster which is alot more powerful.
Existing network infrastructure/cabling/switches/ports are being used other than new Check Point Appliances and versions.
After this migration, all 6 VS`es are failing back/forth between the VSX-clusters, complaining about CCP-packets, Sync and other intercaces.
This happens every 1-3-5-8 hours and every time it happens it causes some network issues, vpn-tunnel to third party fails etc.
Right before a failover happens we notice a "spike" for the "fw_full" process, however we dont find anything in the .elg logfiles:
##############
11 11:37:30 2023 fw-vsxnode1 spike_detective: spike info: type: thread, thread id: 1281, thread name: fw_full, start time: 11/12/23 11:37:18, spike duration (sec): 11, initial cpu usage: 99, average cpu usage: 99, perf taken: 0
Dec 11 11:37:48 2023 fw-vsxnode1 spike_detective: spike info: type: cpu, cpu core: 2, top consumer: fw_full, start time: 11/12/23 11:37:29, spike duration (sec): 18, initial cpu usage: 95, average cpu usage: 80, perf taken: 0
Dec 11 11:37:48 2023 fw-vsxnode1 spike_detective: spike info: type: thread, thread id: 1281, thread name: fw_full, start time: 11/12/23 11:37:35, spike duration (sec): 12, initial cpu usage: 99, average cpu usage: 80, perf taken: 0
Dec 11 11:38:49 2023 fw-vsxnode1 fwk: CLUS-110305-1: State change: ACTIVE -> ACTIVE(!) | Reason: Interface wrp768 is down (Cluster Control Protocol packets are not received)
Dec 11 11:38:51 2023 fw-vsxnode1 fwk: CLUS-114904-1: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
Dec 11 11:38:54 2023 fw-vsxnode1 fwk: CLUS-110305-1: State change: ACTIVE -> ACTIVE(!) | Reason: Interface wrp768 is down (Cluster Control Protocol packets are not received)
Dec 11 11:38:57 2023 fw-vsxnode1 fwk: CLUS-110305-1: State remains: ACTIVE! | Reason: Interface Sync is down (Cluster Control Protocol packets are not received)
##############
We of course registered a High Priority case with TAC (4+ days ago) but not much help or replies so far even with a Manager Escalation, so we are getting a bit frustrated 🙂
Any hint/tricks on where i should look to investigate this issue further myself ?
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are now 13 hours after upgrade to T41 and also verified that the kernel parameter is set to the same as in fwkern.conf and have not had any VS failovers.
We usually had them every 1-2 hours, up to 5 hours so so far its looking very good!
Will probably never know if it was T26 or the kernel parameter that was the root cause of this and frankly after 24/7 headache for several weeks I dont really care 😄
We have also had a strange issue with latency in logging that only affected Firewalls (non VSX, including VE and various Appliance models) that the Smartcenter controls via these VS`s. Log for these were lagging 2-3 hours behind (making traffic investigations a pain) but due to this other issue with VSX/VS we didnt have the time to look at it.. This issue is now also solved. As we didnt get approval to patch the Management at the same time the issue has to be a traffic related issue and the VS`es breaking log-traffic..
Policy push also seems to be noticable faster. So far it looks like Santa came early this year 🙂
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Maybe run below script and send the results, should help us.
Andy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for a quick reply! 🙂
Customer is very strict on any software to install as it controls important national infrastructure so any third party tool has to go through review first unfortunately so will take a while before i could run it 😕
Performance shouldnt be an issue as we went from 5600 (8 cores) to 7000 (16 cores, 32 virtual) and the issues started immidiately started after shifting from 5600 R81.10 -> 7000 R81.20 T26, even switchports are the same as before so i definately suspect a bug
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I can tell you based on all my testing in the lab with jumbo 41, its FANTASTIC! I find traffic is better, cpu/memory as well, also policy push is faster than before.
Definitely something to consider, but, you can certainly also work with TAC to get better understanding as to what could be causing this.
Kind regards,
Andy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Im really eager to try it out myself, took us a month to get a maintenance window to do this change, do know ill be able to get an emergency change for jumbo patching but want to make sure we dont also need a portfix at the same time.Hopefully TAC will reply soon 🙂
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Lets hope so. I also see take 41 is indeed recommended take at the moment, so would not surprise me if they suggest it to you.
Best regards,
Andy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
During investigations with TAC we found a strange mismatch on a kernel parameter.
We did not create/modify the fwkern.conf file ourselves but find the following:
####
[Expert@fw-vsxnode1:0]# cat $FWDIR/boot/modules/fwkern.conf
fwha_enable_state_machine_by_vs=1
fwha_active_standby_bridge_mode=1
fwmultik_sync_processing_enabled=0
####
When checking the "fwmultik_sync_processing_enabled" parameter it shows as enabled.
####
[Expert@fw-vsxnode1:0]# fw ctl get int fwmultik_sync_processing_enabled
fwmultik_sync_processing_enabled =1
####
The gateways havent been rebooted since after the implementation/reconfigure process where they were both rebooted.
Not sure why the kernel parameter differs from fwkern.conf but it has clearly been added here but not loaded.
sk105762 is also somewhat confusing, under instructions it says
"Note: Beginning in Jumbo Hotfix Accumulator R80.40 Take 78 (PRJ-13177) and on all GA versions starting R81, this mode is supported when the Security Gateway is configured in VSX/USFW mode. In a cluster environment, this procedure does not have to be performed on all members of the cluster because it enables monitoring only
"
Hoping that the boot and/or T41 will also fix this, but most importantly solve the failover issue without introducing new bugs 🙂
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have high hopes, as they say. Take 41 seems super stable, so Im sure it will help. Just make sure right values are indeed set in fwkern.conf file, so the sync processing parameter is supposed to equal 1 and not 0, so should be like this:
[Expert@fw-vsxnode1:0]# cat $FWDIR/boot/modules/fwkern.conf
fwha_enable_state_machine_by_vs=1
fwha_active_standby_bridge_mode=1
fwmultik_sync_processing_enabled=1
Let us know how it goes mate.
Andy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks! 🙂
Is the "fwmultik_sync_processing_enabled" supposed to equal 1 and not 0?
I checked another R81.20 VSX installation and here it was "0" in both fwkern.conf and with "fw ctl get"
Also checked another R81.10 VSX installation, and here there were nothing in fwkern.conf but "fw ctl get" showed it was set to 0.
On this installation its set to 0 in fwkern.conf (not by me) but to "1" in kernel 😄
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry, geesh, I am like chatGPT today, all over the place lol
You are right, what you sent is correct, just needs a reboot to update properly. Seems like someone set it on a fly, which would explain the output of other command.
Andy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Haha, Great, was getting even more confused here 😄
Nobody else touches this cluster but me so the parameter has to have been set this way in fwkern.conf automagically but for some reason not loaded correctly. Crossing fingers for tonight 🙂
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, that makes sense...dont worry mate, only one confused here is me 😄
I am positive about tonight that all will go well.
Andy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are now 13 hours after upgrade to T41 and also verified that the kernel parameter is set to the same as in fwkern.conf and have not had any VS failovers.
We usually had them every 1-2 hours, up to 5 hours so so far its looking very good!
Will probably never know if it was T26 or the kernel parameter that was the root cause of this and frankly after 24/7 headache for several weeks I dont really care 😄
We have also had a strange issue with latency in logging that only affected Firewalls (non VSX, including VE and various Appliance models) that the Smartcenter controls via these VS`s. Log for these were lagging 2-3 hours behind (making traffic investigations a pain) but due to this other issue with VSX/VS we didnt have the time to look at it.. This issue is now also solved. As we didnt get approval to patch the Management at the same time the issue has to be a traffic related issue and the VS`es breaking log-traffic..
Policy push also seems to be noticable faster. So far it looks like Santa came early this year 🙂
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Awesome news @PetterD , glad its fixed!
Andy
