After upgrade from R80.10 to R80.30 - Virtual Syst...

Vincent_Croes · ‎2021-11-12

Hi CheckMates

Wondering if anyone had the a similar experience as we have. We are upgrading two 23500 appliances running in VSX mode on R80.10.

We succeeded to upgrade both appliances to R80.30 using an in-place upgrade via CPUSE. Everything seems fine however, if we reboot one member (doesn't matter which one) we observe states like DOWN-READY for multiple VSes and this obviously causes impact.

The duration of this state varies but can go from 10 seconds to 30 seconds. In the end, everything recovers and the cluster becomes fully operational.

We have tried the following (and more)

A full R80.30 install + reconfigure results in the same issue
Limiting the kernel parameters to the bare minimum
Playing with the CCP clustering method (uni, broad and multicast)
Changing L2 equipment (both units are connected to a different single switch)
Connecting the Sync interface link-local to the other node
Checking the CPU / memory load
Checking for issues on the NIC's / cables
Installing the latest JHF

Note that cpstop; cpstart does not result in the same issue. This results in a proper failover and failback! The only solution (during reboot) so far are these two parameters below. No idea why they are needed in our R80.30 configuration.

fwha_dead_timeout_multiplier=12
fwha_timer_cpha_res=12

Anyone have any advice or experience?

Chris_Atkinson · ‎2021-11-12

Hi,

Can you please confirm some items:

- what is the portfast mode of all connected switch ports (edge)?

- is the sync port configured as a bond?

CCSM R77/R80/ELITE

Vincent_Croes · ‎2021-11-12

Hi

- 'spanning-tree port type edge trunk'

- No, it uses the native 'Sync' interface

Please note that reverting back to R80.10, the issue is resolved.

Vincent_Croes · ‎2021-11-15

Bump. Anyone?

Kaspars_Zibarts · ‎2021-11-15

Just to confirm:

- you have done full fresh install + vsx_util_reconfigure on both nodes?

- this does not affect active box - only rebooted node shows various VS states?

- what does cphaprob stat and cphaprob -a if say on particular VSes, what problem they report?

- you are not observing packet loss between boxes on sync traffic?

- do you use virtual switches or routers?

As far as I remember we never saw anything like that going R80.10 - R80.30. But it's been a while, we have been on R80.40 for quite a while now.

Vincent_Croes · ‎2021-11-15

- You have done full fresh install + vsx_util_reconfigure on both nodes?

-- Yes

- This does not affect active box - only rebooted node shows various VS states?

-- The state on the active box goes into a DOWN state and the rebooted member always goes into READY

- What does cphaprob stat and cphaprob -a if say on particular VSes, what problem they report?

-- 'cphaprob' reflects the actual status, so on the active node, it reports DOWN during the reboot of other node and the reason for that is IAC. It reports that multiple interfaces are down. The gest of it is, Inbound is UP but outbound is DOWN.

- You are not observing packet loss between boxes on sync traffic?

-- We are not observing packet loss between anything.

- Do you use virtual switches or routers?

-- Yes we use virtual switches

Kaspars_Zibarts · ‎2021-11-16

I would comb through fwk.elg files (both VS0 and other VSes) as they have full history of clustering state changes and possible causes. TAC case as suggested by Val sounds reasonable if you are stuck 🙂

Vincent_Croes · ‎2021-11-16

TAC case was already logged but little to no progress was made as to the rootcause of this problem. Just wanted to hear if anyone on CheckMates had any similar experiences.

Kaspars_Zibarts · ‎2021-11-16

did you read this SK? sk43872 quite a bit of info regarding kernel parameters you changed

Vincent_Croes · ‎2021-11-17

The SK doesn't explain why we needed those parameters in R80.30, whilst the cluster just worked fine on version R80.10. If there is a valid technical reason as to why these are needed, we are happy to hear it.

_Val_ · ‎2021-11-17

ClusterXL (and also SecureXL & CoreXL) have been changed drastically between these two versions, which could be a "valid technical reason" that clustering parameters would be changed between the versions.

However, a remaining cluster member should not go from Active to Down during reboot of the second member. I would ask TAC to concentrate on this symptom.

The second cluster member coming up as Ready is normal, in my view. It cannot be anything else before full sync is completed, and there is no Active member to request it from. Crack why the other guys is Down, and you solve the problem.

Vincent_Croes · ‎2021-11-17

I agree with this statement. Still, two weeks in and not much has progressed.

Kaspars_Zibarts · ‎2021-11-17

Not to ruffle feathers Val, but I never seen in normal circumstances cluster member entering READY state apart from upgrade when members are running different versions (HW and/or SW). As far as I have seen it it does DOWN > INIT > STANDBY (or ACTIVE if it's a higher priority member with corresponding cluster setting)

There is a fairly set list of cases that will trigger READY state on VSX: (sk42096 )

There are cluster members with a lower software version on this subnet / VLAN
[member with higher software version will go into state 'Ready'].

The number of CoreXL FireWall instances on cluster members is different
[member with greater number of CoreXL FW instances will go into state 'Ready'].

Note: This applies only to R80.10 and lower versions.

The ID numbers of CoreXL FireWall instances and handling CPU core numbers on cluster members are different.

On Gaia OS - Linux kernels on cluster members are different (32-bit vs 64-bit)
[member with higher kernel edition will go into state 'Ready'].

On Gaia OS - Cluster member runs in VSX mode, while other members run in Gateway mode
[member in VSX mode will go into state 'Ready'].

Also checked fwk.elg history on my VSX and did not see a single READY state there apart from upgrade 🙂

grep CLUS $FWDIR/log/fwk.elg* | grep "State change"|grep READY
[3 Jul 17:09:11][fw4_0];[vs_0];CLUS-115303-1: State change: DOWN -> READY | Reason: Member with older software release has been detected
[3 Jul 17:15:54][fw4_0];[vs_0];CLUS-115303-1: State change: INIT -> READY | Reason: Member with older software release has been detected
[3 Jul 17:36:37][fw4_0];[vs_0];CLUS-115303-1: State change: INIT -> READY | Reason: Member with older software release has been detected
[3 Jul 17:55:28][fw4_0];[vs_0];CLUS-112100-1: State change: READY -> DOWN | Reason: FULLSYNC PNOTE

@Vincent_Croes - I hope you have verified CoreXL allocations on both members and they are identical and also looked at fwk.elg logs, they might give a hint for member entering READY state?

Kaspars_Zibarts · ‎2021-11-17

@Vincent_Croes if you needed the command 🙂 this will show any VS1-9, not VS0

grep CLUS /var/log/opt/CPsuite-R80.30/fw1/CTX/CTX0000?/fwk.elg*|grep "State change"

Vincent_Croes · ‎2021-11-18

Thank you.

Vincent_Croes · ‎2021-11-18

I haven't seen the READY state except for upgrade scenario's.

None of the fwk.elg files mention the READY state and as for the DOWN state, it mentions interfaces being down (same output as cphaprob -a if) because his buddy is being rebooted. However IMO that is not a reason to go DOWN, that is a reason to go ACTIVE ATTENTION.

It kinda looks like when he is not able to receive CCP packets from his buddy, he switches to the DOWN state for his VS'es.

_Val_ · ‎2021-11-18

I will say it the third and last time. READY should not be even seen under normal circumstances, as we all agree 🙂 However, your situation is not normal. The second member remains in READY state because it cannot request full sync.

Forget about READY, look into DOWN one, it is the key.

_Val_ · ‎2021-11-18

We are both correct.

Ready means cluster cannot initialise delta sync. The reasons are: different versions, unmatched CoreXL, and as I mentioned, full sync is not yet done.

You actually can see it in your own log, the last line:

[3 Jul 17:55:28][fw4_0];[vs_0];CLUS-112100-1: State change: READY -> DOWN | Reason: FULLSYNC PNOTE

In a fully operational cluster that READY state is too short to notice. READY -> full sync request -> DOWN -> sync complete -> STANDBY, this is how the normal cycle looks. But if there is no ACTIVE member, the booting cluster member remains READY, as there is nowhere to send full sync request.

Chris_Atkinson · ‎2021-11-15

What other changes were made if any during/post upgrade? e.g.

- CoreXL

- HT / SMT

- Dynamic Dispatcher

- Multi-queue

CCSM R77/R80/ELITE

Vincent_Croes · ‎2021-11-15

- CoreXL

-- Has been modified: moved a non MQ interface (MGMT) to a different core

- HT / SMT

-- Hasn't been modified.

- Dynamic Dispatcher

-- In R80.10 VSX, we didn't have DP. In R80.30, this is defaultly activated. So coming from R80.10 to R80.30, this now active.

- Multi-queue

-- Hasn't been modified.

_Val_ · ‎2021-11-15

Please open a TAC case for this.

Are you a member of CheckMates?

After upgrade from R80.10 to R80.30 - Virtual Systems go into DOWN-READY state if one members reboot