Sync Bond issue during VSX upgrade

xiro · ‎2019-12-29

Hi,

I'm currently trying to upgrade our (fortunately not yet productive) VSX environment from 80.20 to 80.30 via "Connectivity Upgrade".

Unfortunately I ran into an issue, that causes me some pain and I don't know how to proceed.

Following situation:

The both VSX Gateways are connected via Sync-Bond (bond2 - two direct cables running between them, no switches involved).

After I followed the instructions from "Installation an Upgrade Guide R80.30" for "Connectivity Upgrade of a VSX Cluster" until step 4, where I upgraded the standby member to R80.30 via clish CPUSE. At that moment, I realised that the status of the members is not as expected.

As far as I understood, the primary member should stay "ACTIVE", whereas the upgraded one should go in a "READY" state.

In my case, they seem to have lost the sync between them, so both sides are now active:

Member 1 (not upgraded):

Member 2 (upgraded):

If I check the "cphaprob -a if" on the members, I see some strange behavior. Member 1 is constantly transitioning from up to down:

If you repeat the command in short intervals, you see the timer going up to 5 seconds, then suddenly the status changes to following:

And the next iteration is "DOWN" again.

On the other member (upgraded) the status is constantly at "Inbound: UP - Outbound: DOWN"

The cabling was left untouched, the bond config seems OK on both sides.

I'm not sure how to proceed further. I considered this as a connectivity-upgrade test before everything goes into production, but in that case it failed completely...

Any help is appreciated 🙂

Maarten_Sjouw · ‎2019-12-29

First thing to check is what the cluster sync is set to on both members, multicast, broadcast, unicast or auto, just make sure they are both set to the same.
As long as this is not production I would make member 2 the only active, by cpstop on member 1 and continue the upgrade there.
I have seen similar issues with clusterXL on VSX before where I had a clean install that during the vsxt_util reconfigure took over from the active member.

Regards, Maarten

xiro · ‎2019-12-29

Thanks Maarten,

I've checked that on both sides and configured both of them to broadcast, but that didn't resolve the issue.

I then rebooted member1 just out of frustration.

Now the status on member2 is "READY" and "cphaprob -a if" shows bond2 constantly UP, but on member1 it is constantly "Inbound UP - Outbound DOWN".

I then found sk65560 describing all the possible causes and solutions, but none of them seems plausible:

Cause

Physical/Logical connectivity issue due to one of the following:

~~Bad switch configuration (factors such as: Speed, Duplex, Flow Control, etc).~~
~~Bad network cable.~~
~~Bad switch port (if it is a copper port, verify that that are no bent or missing pins in the socket).~~
~~High latency on switch (switch might be under heavy load or have poor connection).~~ -> No switches involved
Bad port on appliance (if it is a copper port, verify that there are no bent or missing pins in the socket). -> not likely, before starting the upgrade everything was fine.
Subnet mis-match between cluster members on the interface shown to have the issue. -> No, both are in the same subnet:192.168.191.1/2
Mismatch in monitor mode - monitor mode is not supported in ClusterXL -> monitor mode not in use
Anti-Spoofing is not configured correctly. -> For Sync interface Anti-Spoofing isn't configurable (at least from SC)
IGMP Membership issues (often occurs with Nexus switches) -> no switches used, direct connection
Network Adapter LAN segment (when working with Virtual Machines) are mismatched between cluster members. -> physical appliances
Cluster ID is already in use -> only CP firewalls that we have in use, directly connected, they can't see another cluster, even if one was there...

I've also checked the logging our logging, there's something suspicious there:

But I'm not sure what that means.
It's originating from member1.

Regarding your suggestion to go on:
This will be a future 24/7 productive environment, that's why the CU feature is very important for me. I would like to find the cause of this issue, otherwise we may run into the same issue at the next update. Currently I can take the time to troubleshoot, which later won't be possible that easy.

Maarten_Sjouw · ‎2019-12-29

Setting CCP to broadcast is the worst of them all, always try to use unicast if possible.

What I would do is start all over again, if I was trying to really find the issue, start with a clean install of R80.20 with the latest jumbo and do a vsx_util reconfigure. Then start the migration again and see what happens

Regards, Maarten

Are you a member of CheckMates?

Sync Bond issue during VSX upgrade