Running through a VSX cluster upgrade in a lab to validate implementation plans and provide some confidence in the process.
Having unexpected failovers during reconfiguration of a new R80.40 cluster member, after the old (77.30) member is stopped.
The process is slightly unorthodox, as we're introducing new hardware with the upgrade, and moving some VLANs from a standalone NIC to port channels but has worked successfully in the past (not a case for change_interfaces here). The new hardware gets assigned the current management IPs allowing the customer to keep everything consistent, including management traffic that traverses other gateways. While VSLS is enabled, all systems are active on M1 and VSLS mode is Active Up at the start of the change (pre-empt should always be disabled as good practice).
1. After the first new device (M2) reconfiguration, all Virtual Systems are stuck with InitialPolicy and fw fetch fails from each VS (I believe this has something to do with masters and SIC names which look like the CMA was once a standalone SMS with different name). I've had to manually install policy from the CMA (or policy install preset from the MDS)
Is it possible that because the CMA was once an SMS, that SIC names have changed and prevented the gateway from fetching / policy installing? After manual install, MVC works fine and we're able to failover virtual systems from the 77.30 member without outage. At this point all VS are active on M2, MVC disabled, old M1 taken offline (re-IP, cpstop, shutdown dataplane ports etc.)
2. During second new device (M1) reconfig, some VSs manage to get policy, the HA module starts and some become active on M1 despite VSLS active up and M1 reconfiguration incomplete.
While watching cphaprob stat, we can see the VS on M2 going DOWN momentarily before STANDBY with the message "Incorrect configuration - Local cluster member has fewer cluster interfaces configured compared to other cluster member(s)" with IAC pnote.
This doesn't make any sense, as each VS has an identical number of cluster interfaces (should have) - is there a reconfig subprocess that can change this and affect cXL?
3. Once reconfig on M1 is complete and rebooted, we still have 1 VS with no policy. Installation from the console fails with "Peer SIC Certificate has been revoked try to reset SIC on the peer and re-establish the trust".
Unfortunately sk174046 did not resolve this, and the article is full of errors. sk34098 did, but only when the cert was pulled from the VS. I can't think of any reason why this VS was unique, other than the legacy SMS name in the certificates.
4. After the reboot of M1 and broken VS recovered, all virtual systems are failed over from M2 to M1 using a clusterXL_admin for loop. This works for all except about 3 VS which immediately fail back to M2, and the VSX gateway itself.
On M1, we can see those VS transition between DOWN and STANDBY about every 5 seconds, when down the message is "Interface wrpXYZ is down (Cluster Control Protocol packets are not received)". But I don't see any CCP traffic on interfaces on other VSs reported UP by clusterXL either...