- Products
- Learn
- Local User Groups
- Partners
- More
Firewall Uptime, Reimagined
How AIOps Simplifies Operations and Prevents Outages
Introduction to Lakera:
Securing the AI Frontier!
Check Point Named Leader
2025 Gartner® Magic Quadrant™ for Hybrid Mesh Firewall
HTTPS Inspection
Help us to understand your needs better
CheckMates Go:
SharePoint CVEs and More!
Hi CheckMates
Wondering if anyone had the a similar experience as we have. We are upgrading two 23500 appliances running in VSX mode on R80.10.
We succeeded to upgrade both appliances to R80.30 using an in-place upgrade via CPUSE. Everything seems fine however, if we reboot one member (doesn't matter which one) we observe states like DOWN-READY for multiple VSes and this obviously causes impact.
The duration of this state varies but can go from 10 seconds to 30 seconds. In the end, everything recovers and the cluster becomes fully operational.
We have tried the following (and more)
Note that cpstop; cpstart does not result in the same issue. This results in a proper failover and failback! The only solution (during reboot) so far are these two parameters below. No idea why they are needed in our R80.30 configuration.
fwha_dead_timeout_multiplier=12
fwha_timer_cpha_res=12
Anyone have any advice or experience?
Hi,
Can you please confirm some items:
- what is the portfast mode of all connected switch ports (edge)?
- is the sync port configured as a bond?
Hi
- 'spanning-tree port type edge trunk'
- No, it uses the native 'Sync' interface
Please note that reverting back to R80.10, the issue is resolved.
Bump. Anyone?
Just to confirm:
- you have done full fresh install + vsx_util_reconfigure on both nodes?
- this does not affect active box - only rebooted node shows various VS states?
- what does cphaprob stat and cphaprob -a if say on particular VSes, what problem they report?
- you are not observing packet loss between boxes on sync traffic?
- do you use virtual switches or routers?
As far as I remember we never saw anything like that going R80.10 - R80.30. But it's been a while, we have been on R80.40 for quite a while now.
- You have done full fresh install + vsx_util_reconfigure on both nodes?
-- Yes
- This does not affect active box - only rebooted node shows various VS states?
-- The state on the active box goes into a DOWN state and the rebooted member always goes into READY
- What does cphaprob stat and cphaprob -a if say on particular VSes, what problem they report?
-- 'cphaprob' reflects the actual status, so on the active node, it reports DOWN during the reboot of other node and the reason for that is IAC. It reports that multiple interfaces are down. The gest of it is, Inbound is UP but outbound is DOWN.
- You are not observing packet loss between boxes on sync traffic?
-- We are not observing packet loss between anything.
- Do you use virtual switches or routers?
-- Yes we use virtual switches
I would comb through fwk.elg files (both VS0 and other VSes) as they have full history of clustering state changes and possible causes. TAC case as suggested by Val sounds reasonable if you are stuck 🙂
TAC case was already logged but little to no progress was made as to the rootcause of this problem. Just wanted to hear if anyone on CheckMates had any similar experiences.
did you read this SK? sk43872 quite a bit of info regarding kernel parameters you changed
The SK doesn't explain why we needed those parameters in R80.30, whilst the cluster just worked fine on version R80.10. If there is a valid technical reason as to why these are needed, we are happy to hear it.
ClusterXL (and also SecureXL & CoreXL) have been changed drastically between these two versions, which could be a "valid technical reason" that clustering parameters would be changed between the versions.
However, a remaining cluster member should not go from Active to Down during reboot of the second member. I would ask TAC to concentrate on this symptom.
The second cluster member coming up as Ready is normal, in my view. It cannot be anything else before full sync is completed, and there is no Active member to request it from. Crack why the other guys is Down, and you solve the problem.
I agree with this statement. Still, two weeks in and not much has progressed.
Not to ruffle feathers Val, but I never seen in normal circumstances cluster member entering READY state apart from upgrade when members are running different versions (HW and/or SW). As far as I have seen it it does DOWN > INIT > STANDBY (or ACTIVE if it's a higher priority member with corresponding cluster setting)
There is a fairly set list of cases that will trigger READY state on VSX: (sk42096 )
There are cluster members with a lower software version on this subnet / VLAN
[member with higher software version will go into state 'Ready'].
The number of CoreXL FireWall instances on cluster members is different
[member with greater number of CoreXL FW instances will go into state 'Ready'].
Note: This applies only to R80.10 and lower versions.
The ID numbers of CoreXL FireWall instances and handling CPU core numbers on cluster members are different.
On Gaia OS - Linux kernels on cluster members are different (32-bit vs 64-bit)
[member with higher kernel edition will go into state 'Ready'].
On Gaia OS - Cluster member runs in VSX mode, while other members run in Gateway mode
[member in VSX mode will go into state 'Ready'].
Also checked fwk.elg history on my VSX and did not see a single READY state there apart from upgrade 🙂
grep CLUS $FWDIR/log/fwk.elg* | grep "State change"|grep READY
[3 Jul 17:09:11][fw4_0];[vs_0];CLUS-115303-1: State change: DOWN -> READY | Reason: Member with older software release has been detected
[3 Jul 17:15:54][fw4_0];[vs_0];CLUS-115303-1: State change: INIT -> READY | Reason: Member with older software release has been detected
[3 Jul 17:36:37][fw4_0];[vs_0];CLUS-115303-1: State change: INIT -> READY | Reason: Member with older software release has been detected
[3 Jul 17:55:28][fw4_0];[vs_0];CLUS-112100-1: State change: READY -> DOWN | Reason: FULLSYNC PNOTE
@Vincent_Croes - I hope you have verified CoreXL allocations on both members and they are identical and also looked at fwk.elg logs, they might give a hint for member entering READY state?
@Vincent_Croes if you needed the command 🙂 this will show any VS1-9, not VS0
grep CLUS /var/log/opt/CPsuite-R80.30/fw1/CTX/CTX0000?/fwk.elg*|grep "State change"
Thank you.
I haven't seen the READY state except for upgrade scenario's.
None of the fwk.elg files mention the READY state and as for the DOWN state, it mentions interfaces being down (same output as cphaprob -a if) because his buddy is being rebooted. However IMO that is not a reason to go DOWN, that is a reason to go ACTIVE ATTENTION.
It kinda looks like when he is not able to receive CCP packets from his buddy, he switches to the DOWN state for his VS'es.
I will say it the third and last time. READY should not be even seen under normal circumstances, as we all agree 🙂 However, your situation is not normal. The second member remains in READY state because it cannot request full sync.
Forget about READY, look into DOWN one, it is the key.
We are both correct.
Ready means cluster cannot initialise delta sync. The reasons are: different versions, unmatched CoreXL, and as I mentioned, full sync is not yet done.
You actually can see it in your own log, the last line:
[3 Jul 17:55:28][fw4_0];[vs_0];CLUS-112100-1: State change: READY -> DOWN | Reason: FULLSYNC PNOTE
In a fully operational cluster that READY state is too short to notice. READY -> full sync request -> DOWN -> sync complete -> STANDBY, this is how the normal cycle looks. But if there is no ACTIVE member, the booting cluster member remains READY, as there is nowhere to send full sync request.
What other changes were made if any during/post upgrade? e.g.
- CoreXL
- HT / SMT
- Dynamic Dispatcher
- Multi-queue
- CoreXL
-- Has been modified: moved a non MQ interface (MGMT) to a different core
- HT / SMT
-- Hasn't been modified.
- Dynamic Dispatcher
-- In R80.10 VSX, we didn't have DP. In R80.30, this is defaultly activated. So coming from R80.10 to R80.30, this now active.
- Multi-queue
-- Hasn't been modified.
Please open a TAC case for this.
Leaderboard
Epsum factorial non deposit quid pro quo hic escorol.
User | Count |
---|---|
17 | |
12 | |
6 | |
5 | |
5 | |
4 | |
4 | |
4 | |
4 | |
3 |
Tue 07 Oct 2025 @ 10:00 AM (CEST)
Cloud Architect Series: AI-Powered API Security with CloudGuard WAFThu 09 Oct 2025 @ 10:00 AM (CEST)
CheckMates Live BeLux: Discover How to Stop Data Leaks in GenAI Tools: Live Demo You Can’t Miss!Thu 09 Oct 2025 @ 10:00 AM (CEST)
CheckMates Live BeLux: Discover How to Stop Data Leaks in GenAI Tools: Live Demo You Can’t Miss!Wed 22 Oct 2025 @ 11:00 AM (EDT)
Firewall Uptime, Reimagined: How AIOps Simplifies Operations and Prevents OutagesAbout CheckMates
Learn Check Point
Advanced Learning
YOU DESERVE THE BEST SECURITY