Re: Upgrading 3000 cluster to 5000 cluster

Stuart_Green · ‎2021-02-02

Hi,

So we have a functioning HA pair of 3200s at a branch office. Everything works fine.

We need to replace these with an HA pair of 5200s so the 3200s can be rotated out to a different office.

I’ve built the 5200s from base image (R80.40) and addresses the interfaces identically to the 3200 cluster, including MAC addresses.

Reset SIC in SD, get topology, model and OS from firewalls then push policy.

New cluster establishes and traffic flows out of the LAN down the MPLS.

You’d think that would be fine and dandy but it isn’t.

No TCP sessions establish from the LAN. Can see first packets arriving at the Internet breakout at the perimeter. 5200s can get updates and access everywhere internally that they should. Never any sessions from the LAN. Firewall rules unchanged. Logging on origin firewalls identical to originals.

ARP caches flushed on all switches and routers at the branch office. Still no sessions.

ClusterXL fine. No errors in any logs. Licenses all fine

Plug the old cluster back in and everything works.

Any ideas?

TIA

G_W_Albrecht · ‎2021-02-02

TAC.

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

Maarten_Sjouw · ‎2021-02-02

One thing is bogging me when I read your story: "including MAC addresses." Why? Use arping to let all devices around know that the mac has changed, use the vMAC address config on your cluster object is always a good thing to do and do not forget to make sure that when you do arping, you also do it for the VIP.

Have you also thought of the Internet router to clear the arp? Keep in mind that Cisco routers do have a arp cache of 4 hours!!

Gratuitous ARP to force a new MAC addess towards the router

Enable binding to non-local IP addresses on-the-fly (addresses not directly assigned to an interface):
cat /proc/sys/net/ipv4/ip_nonlocal_bind
echo 1 > /proc/sys/net/ipv4/ip_nonlocal_bind ## use 0=off, 1=on
cat /proc/sys/net/ipv4/ip_nonlocal_bind

arping -c 4 -A -I eth3 1.2.3.4

Regards, Maarten

Stuart_Green · ‎2021-02-02

Thanks Maarten,

I included the bit about MAC addresses because I think this is ARP related but don’t have access to the Cisco config.

Do you happen to know if the ARP table survives a reboot on a Cisco?

How would you arping for the VIP if using a vMAC?

TIA

Gareth_somers · ‎2021-02-02

Hi Stuart,

The Arp Cache is volatile on Cisco so a reboot should clear it....

Are you by any chance adding one of the 5000s into the cluster with an existing 3000 and are the on a different version (R80.30 or older)? If you are trying to swap in one box to the existing cluster and they are not on the same version you might need to enable MVC, what you are describing there is a lot like that behavior where the cluster won't pass traffic and it looks like ARP: On the 5000 use: set cluster member mvc on to enable it.

Regards,

Gareth

Stuart_Green · ‎2021-02-03

Thanks Gareth.

We thought that about the Ciscos but you do start challenging your own thinking at times!

The 5000s aren't replacing the 3000s one at a time but more of a lift and shift. 5000s were configured exactly the same as the 3000s and then plugged in, SIC reset, Cluster object Hardware upgraded, Topology fetched and then pushed policy.

All of the traffic flowing through the 5000-based cluster is the same as the 3000-based cluster but with the exception of any TCP sessions.

Gareth_somers · ‎2021-02-03

If it was me, I'd verify the LAN side on the cisco, show ip arp <ip address of cluster int> to make sure the entry looks right, then show mac address-table | inc <mac address> to make sure the MAC is showing on the correct interface. If there is a problem on that side, in the case of VRRP you might have a conflict with another service so extended VMAC or moving to a dedicated subnet for the uplinks would solve that. Also check for static apr entries (show run | inc arp).

Assuming that all looks right, I'd run a capture on the cluster Interfaces using tcpdump -i <int> ether host <cluster mac address> and see if traffic is hitting the firewalls. Then start looking at the firewall (AntiSpoofing, routing, SecureXL issue, fw monitor captures etc.)

Best of luck with it, sounds like a nightmare.

Maarten_Sjouw · ‎2021-02-02

As Gareth mentioned, the arp's will not survive a reboot of a router.

From the active gateway you just arping the VIP's, it will just send the real MAC's, but when you do the swap 1 by 1 you should not have the issue at all when using vMAC's as they would not even change.

Regards, Maarten

Stuart_Green · ‎2021-02-03

Thanks Maarten,

Will try the vMAC route as there's nothing else to suggest anything other than ARP dramas. All of the virtual interfaces are up across all physical and VLANs. No errors anywhere.

TIA

Stuart_Green · ‎2021-02-06

Tried vMAC. Still the same.

Bob_Zimmerman · ‎2021-02-03

When in place, were the new firewalls able to ping things in the LAN and/or were things in the LAN able to ping or connect to the firewalls themselves? If so, the problem may just be IP forwarding (i.e. routing). It can be disabled in certain circumstances, which would cause through-traffic to fail like how you describe.

Stuart_Green · ‎2021-02-06

Yep, all the pings work a you say and we can see forwarding happening with the packets arriving at the perimeter.

This looks like the return traffic isn't coming to the cluster address.

the_rock · ‎2021-02-03

Ok, I see lots of responses already, but just asking about basics, in case something was missed..

When this breaks, what does logs show? Did you do simple debug on the firewall to see why this is not working? tcpdump, fw monitor??

Andy

Best,
Andy

Stuart_Green · ‎2021-02-06

Logs are reporting back to the management as normal except only seeing connections rather than sessions. Perimeter firewall is seeing the connections so forwarding does appear to be happening.

the_rock · ‎2021-02-06

Happy to do remote and see if I can help you fix it.

Best,
Andy

Stuart_Green · ‎2021-02-07

Thanks for the offer but found the problem - static routes replaced with OSPF as an undocumented change. Stood up OSPF and problem solved.

Are you a member of CheckMates?

Upgrading 3000 cluster to 5000 cluster