Hello,
I have a customer with a Check Point VSX VSLS cluster with 2 nodes and 6 VS`es.
All nodes are defined to always run on one of the nodes at the HQ while the DC is Standby.
Customer was running on R81.10 on 5600 Appliances but decided to do a HW-refresh and also software update.
So we replaced this cluster with a R81.20 T26 7000 Appliance Cluster which is alot more powerful.
Existing network infrastructure/cabling/switches/ports are being used other than new Check Point Appliances and versions.
After this migration, all 6 VS`es are failing back/forth between the VSX-clusters, complaining about CCP-packets, Sync and other intercaces.
This happens every 1-3-5-8 hours and every time it happens it causes some network issues, vpn-tunnel to third party fails etc.
Right before a failover happens we notice a "spike" for the "fw_full" process, however we dont find anything in the .elg logfiles:
##############
11 11:37:30 2023 fw-vsxnode1 spike_detective: spike info: type: thread, thread id: 1281, thread name: fw_full, start time: 11/12/23 11:37:18, spike duration (sec): 11, initial cpu usage: 99, average cpu usage: 99, perf taken: 0
Dec 11 11:37:48 2023 fw-vsxnode1 spike_detective: spike info: type: cpu, cpu core: 2, top consumer: fw_full, start time: 11/12/23 11:37:29, spike duration (sec): 18, initial cpu usage: 95, average cpu usage: 80, perf taken: 0
Dec 11 11:37:48 2023 fw-vsxnode1 spike_detective: spike info: type: thread, thread id: 1281, thread name: fw_full, start time: 11/12/23 11:37:35, spike duration (sec): 12, initial cpu usage: 99, average cpu usage: 80, perf taken: 0
Dec 11 11:38:49 2023 fw-vsxnode1 fwk: CLUS-110305-1: State change: ACTIVE -> ACTIVE(!) | Reason: Interface wrp768 is down (Cluster Control Protocol packets are not received)
Dec 11 11:38:51 2023 fw-vsxnode1 fwk: CLUS-114904-1: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
Dec 11 11:38:54 2023 fw-vsxnode1 fwk: CLUS-110305-1: State change: ACTIVE -> ACTIVE(!) | Reason: Interface wrp768 is down (Cluster Control Protocol packets are not received)
Dec 11 11:38:57 2023 fw-vsxnode1 fwk: CLUS-110305-1: State remains: ACTIVE! | Reason: Interface Sync is down (Cluster Control Protocol packets are not received)
##############
We of course registered a High Priority case with TAC (4+ days ago) but not much help or replies so far even with a Manager Escalation, so we are getting a bit frustrated 🙂
Any hint/tricks on where i should look to investigate this issue further myself ?
CCSM / CCSE / CCVS / CCTE