Re: VSX Upgrade manual policy installs, unforced V...

cosmos · ‎2023-03-15

Running through a VSX cluster upgrade in a lab to validate implementation plans and provide some confidence in the process.

Having unexpected failovers during reconfiguration of a new R80.40 cluster member, after the old (77.30) member is stopped.

The process is slightly unorthodox, as we're introducing new hardware with the upgrade, and moving some VLANs from a standalone NIC to port channels but has worked successfully in the past (not a case for change_interfaces here). The new hardware gets assigned the current management IPs allowing the customer to keep everything consistent, including management traffic that traverses other gateways. While VSLS is enabled, all systems are active on M1 and VSLS mode is Active Up at the start of the change (pre-empt should always be disabled as good practice).

1. After the first new device (M2) reconfiguration, all Virtual Systems are stuck with InitialPolicy and fw fetch fails from each VS (I believe this has something to do with masters and SIC names which look like the CMA was once a standalone SMS with different name). I've had to manually install policy from the CMA (or policy install preset from the MDS)

Is it possible that because the CMA was once an SMS, that SIC names have changed and prevented the gateway from fetching / policy installing? After manual install, MVC works fine and we're able to failover virtual systems from the 77.30 member without outage. At this point all VS are active on M2, MVC disabled, old M1 taken offline (re-IP, cpstop, shutdown dataplane ports etc.)

2. During second new device (M1) reconfig, some VSs manage to get policy, the HA module starts and some become active on M1 despite VSLS active up and M1 reconfiguration incomplete.

While watching cphaprob stat, we can see the VS on M2 going DOWN momentarily before STANDBY with the message "Incorrect configuration - Local cluster member has fewer cluster interfaces configured compared to other cluster member(s)" with IAC pnote.

This doesn't make any sense, as each VS has an identical number of cluster interfaces (should have) - is there a reconfig subprocess that can change this and affect cXL?

3. Once reconfig on M1 is complete and rebooted, we still have 1 VS with no policy. Installation from the console fails with "Peer SIC Certificate has been revoked try to reset SIC on the peer and re-establish the trust".

Unfortunately sk174046 did not resolve this, and the article is full of errors. sk34098 did, but only when the cert was pulled from the VS. I can't think of any reason why this VS was unique, other than the legacy SMS name in the certificates.

4. After the reboot of M1 and broken VS recovered, all virtual systems are failed over from M2 to M1 using a clusterXL_admin for loop. This works for all except about 3 VS which immediately fail back to M2, and the VSX gateway itself.

On M1, we can see those VS transition between DOWN and STANDBY about every 5 seconds, when down the message is "Interface wrpXYZ is down (Cluster Control Protocol packets are not received)". But I don't see any CCP traffic on interfaces on other VSs reported UP by clusterXL either...

cosmos · ‎2023-03-15

What a novel. After rebooting M1 again, the problem persisted and was affecting VS0 wrp1 more than any other VS. Seems to have stabilised after about 20 minutes (???)

cosmos · ‎2023-03-15

Spoke too soon, still happening on VS0:

vsid 0:
------
CCP mode: Manual (Unicast)
Required interfaces: 3
Required secured interfaces: 1

Interface Name: Status:

Sync (S) UP
Mgmt UP
wrp1 Inbound: DOWN (4.2 secs)
Outbound: DOWN (3.9 secs)

S - sync, LM - link monitor, HA/LS - bond type

PhoneBoy · ‎2023-03-16

You must do a policy installation from the management after upgrading a gateway to a new version.
This is because the previously installed policy was compiled for a different version than the gateway is currently running.

emmap · ‎2023-03-16

When you do a vsx_util upgrade, the management server pre-compiles the policies, so that the VSX gateway can fetch them, all prepared for the new version, when it boots up. So if there's no policy on the VSX/VS after the reboot, something has prevented this fetch from working. How does VS0 route to the management server?

SIC should work, even with the change in mgmt servers, but it sounds like something's a bit messy in there.

Why does VS0 have a wrp link? I'm assuming this is a vswitch, what other interfaces are on this vswitch? Does it have an 'external' link (something out an actual interface) or is it all wrp links?

cosmos · ‎2023-03-19

@emmap wrote:
When you do a vsx_util upgrade, the management server pre-compiles the policies, so that the VSX gateway can fetch them, all prepared for the new version, when it boots up.

I was hoping this was the case, either here or during the reconfiguration process as it takes long enough. What @PhoneBoy describes seems to align with my experience though, as manual policy install wasn't required on M1 post reconfiguration except for the 1 VS. It's as if the manual policy install still performs a necessary step that the upgrade or reconfigure doesn't.

So if there's no policy on the VSX/VS after the reboot, something has prevented this fetch from working. How does VS0 route to the management server?

VS0 (Mgmt) > (VSID 11 > VS0 (Mgmt) on another R80.40 VSX cluster) > MDS. I inherited this mess, have some choice words for those that built it.

SIC should work, even with the change in mgmt servers, but it sounds like something's a bit messy in there.

SIC does appear to work, at least for all but the 1 VS. I've been here before and it wasn't pretty. In this case trust is now established for each VS but fw fetch was failing which I assume is the same process reconfigure uses to pull policy from each VS.

Why does VS0 have a wrp link? I'm assuming this is a vswitch, what other interfaces are on this vswitch? Does it have an 'external' link (something out an actual interface) or is it all wrp links?

Your guess is as good as mine, probably the same admin. The unexpected failovers I think are a separate issue from the policy install, I can live with additional deployment steps but the failovers are concerning - should we be seeing CCP traffic on the wrp links?

emmap · ‎2023-03-19

Yea that sounds like some sub-optimal architecture going on, especially that mgmt routing via another VS. That's a great way to end up with an unrecoverable VSX cluster.

FW fetch I think is used during VSX reconfigure and also during a reboot, so if that's having problems when run manually then that's likely the same issue as the failure to fetch during the upgrade.

I think when there's a vswitch the CCPs are performed on the 'external' interface (the not wrp link) that's required to be there. I'm a bit rusty on the specifics there. Basically the vswitch needs to be able to CCP to its counterpart on the other cluster member.

cosmos · ‎2023-03-20

I think when there's a vswitch the CCPs are performed on the 'external' interface (the not wrp link) that's required to be there. I'm a bit rusty on the specifics there. Basically the vswitch needs to be able to CCP to its counterpart on the other cluster member.

So the CCP traffic we saw on wrp links were rx broadcasts, nothing is actually sent from the wrp link only the 'external' interface but using the wrp link MAC address. Our initial thoughts were the VLAN of the vswitch external interface was not defined on the ESXi vSwitch (like you would on a physical switch else the tagged frame is discarded), however adding it doesn't make any difference and we have since found Virtual Guest Tagging (VGT) does not require it (otherwise all monitored interfaces would be down anyway).

The Virtual Systems that fail (wrp link tx/rx down) don't appear to be consistent either - we can move a VS back to the original member and another VS will fail for a second and become active on the other member. It's like whack-a-mole and I'm wondering if this is also happening in prod (R77.30) just not noticed because VSLS is in primary up...

cosmos · ‎2023-03-21

Still trying to understand why cxl in a VS complains about tx/rx down on a wrp interface, when CCP should be on the physical interface only.

Have also noticed the following:

One of the bond interfaces is reporting slowly incrementing rx drops
1. None of the members report rx drops, may be cosmetic & unrelated
2. BOTH clusters report rx drops on the same bond, when another bond does not
@;309957;[vs_22];[tid_0];[fw4_0];fw_log_drop_ex: Packet proto=-1 ?:0 -> ?:0 dropped by fwha_select_arp_packet Reason: arp src is from local;
1. On practically all VSeses
@;306369;[vs_11];[tid_0];[fw4_0];fw_log_drop_ex: Packet proto=17 169.254.0.33:8116 -> 169.254.0.32:8116 dropped by fw_antispoof_log Reason: Address spoofing;
1. On exactly 2 VS, there is another cluster with interfaces in the same VLAN so most likely that except the internal IPs don't match interfaces on the shared VLANs... not a rabbit hole worth exploring
@;309044;[vs_24];[tid_0];[fw4_0];fw_log_drop_ex: Packet proto=-1 ?:0 -> ?:0 dropped by fwha_select_arp_packet Reason: CPHA replies to arp;
1. On about 5 VS, not sure what's special about those or what to look for based on such an informative message

Even if we untangle the management spaghetti and remove the circular dependencies, this ~~cluster~~ ouroboros keeps performing the same circus act in ESXi and I want to be sure it won't in production.

the_rock · ‎2023-03-19

What is the current state? Is policy still not loading properly AND you are having failover issues as well?

cosmos · ‎2023-03-20

Policy is now loading since install from CMA and revoking / renewing the one VS SIC cert. Failover issues were still present last night in the lab. Hoping I can isolate that to the lab environment (ESXi) today.

Are you a member of CheckMates?

VSX Upgrade manual policy installs, unforced VS failovers