Intro
Hey everyone,
I'm not incredibly familiar with the nomenclature or internal workflows on the Checkpoint Maestro Hyperscale solution, but we're investigating an illusive issue with a particular workflow. I've provided a basic diagram to explain the connections.
Example Fabric
Example Path Through Fabric (leaf switch and spine switch selection are irrelevant)
Diagram Overview
There are 2 firewalls, each connect directly to a single Maestro switch. The Maestro switch is configured with two bridge groups. Traffic should come in from a firewall, enter the Maestro switch, pass through the Checkpoint IPS which is also attached to switch, and exit the South side interfaces to the leaf switches.
The leaf switch pairs each have their own distinct port channels connected to the maestro switch. The leaf switches connect to a spine layer (I've simplified the connectivity so you don't have to look at all of the redundant connections between the leaf and spine Clos architecture).
Problem
Let's call everything on the left side, side A, and everything on the right side, side B for simplicity sake.
If a host behind firewall A, or firewall A itself, on the left side tries to communicate with firewall B, or a host behind firewall B, on the right there is significant delay / jitter.
If a host behind firewall A communicates anywhere else in the network, even another host connected on switch pair B that isn't beyond the Maestro switch, there is no issue at all.
I've provided a second copy of the diagram with a red line to illustrate where things fall down. It doesn't matter if the traffic crosses switch 1 or 2 in pair A or B, or any of the 3 spine switches, the result is always the same.
We have sub-second latency between switch pair A and B. All other inter-leaf pair communications in the fabric work as expected.
My limited understanding of the Maestro switch is that when slave interfaces are assigned to a bridge, layer 2 traffic passively traverses the bridge from North to South, and can't communicate with another bridge. I don't understand how we exit the bridge to get to the IPS, but it appears either bridge can fork traffic to the attached IPS.
When we do a packet capture from a SPAN on our leaf switches we're seeing tons of TCP retransmits and out of order packets. For example Host A tries to start TCP 3 way handshake and sends a SYN across the wire. Host B doesn't receive the SYN for more time than is expected creating many retransmits, and finally it will receive it and replies back with SYN ACK. Host A now doesn't receive the SYNACK back so Host B starts retransmitting until finally an ACK is seen. Even after the underlying protocol is negotiated, the issue persists through the entire connection.
What We've Tried
TCP/UDP connection from host behind Firewall A or B to remote firewall in another data center. Result: Works great
TCP/UDP connection from host behind Firewall A or B across WAN. Result: Works great
TCP/UDP connection initiated from maestro facing interface on either Firewall A or B terminating directly on maestro facing interface on the opposing firewall. Result: Bad
TCP/UDP connection from host behind Firewall A or B to maestro facing interface on opposing firewall. Result: Bad
TCP/UDP connection from host behind Firewall A or B to another host behind the opposing firewall. Result: Bad
TCP/UDP connection from host behind Firewall A or B to maestro facing interface on locally connected firewall. Result: Works great
Disabling IPS policy enforcement temporarily for troubleshooting (Although traffic may still pass through the IPS despite the policies being turned off?) Result: Issue still occurs
Disabling firewall inspection policies related to TCP/IP based connections (including on both firewalls at the same time) Result: Issue still occurs
TCP/UDP connection originating from Switch Pair A or B to the opposing Switch Pair across the fabric. Result: Works great
TCP/UDP connection originating from Switch Pair A or B to the opposing firewall across the fabric. Result: Works great
Questions
I read somewhere on a Checkpoint forum post that traffic passing through the same Maestro twice could present issues? Is anyone aware of any limitations or bugs in a setup like this? The Maestro switch connections are meant to be passive, and as such we only see the firewall's MAC addresses advertised across, but our LACP peering is with the Checkpoint MACs. Each distinct switch pair sees a unique MAC for it's LACP peer. Any ideas?