Hi all,
I'm having a very weird issue with HA MGMT involving 2 sites having taken the latest Take 78. Bear with me, its hard to word. I've put a good 2 hrs into trying to word it and tabulate results. Even pulled a colleague in.
Please see network diagram of CP Mgmt:
Primary Site and MGMT is at Site 1. MGMT and ClusterXL members subnet is 10.10.171.0/24
Secondary Site and MGMT is at Site 2. MGMT and ClusterXL members subnet is 10.20.171.0/24
The two sites share VLAN 173 and I use that to route in between them.
Static route at site 1:
S 10.20.171.0/24 via 172.30.0.4, eth4, cost 0, age 13639
Static route at site 2:
S 10.10.171.0/24 via 172.30.0.1, eth4, cost 0, age 837
Each cluster has a VIP on VLAN 173, eth4 for all cluster members:
Site 1: 172.30.0.1
Site 2: 172.30.0.4
I've been using Site 1 to manage both sites since install. All green. Pushing policies, sync'ing mgmt servers, etc. Both MGMT servers are using their respective cluster members as the gateway.
As sanity checks, I like to do "curl_cli -k https://some_cp_member" to verify members or mgmt servers can reach other and that a TLS connection would succeed.
Today, I pushed Take 78 to both clusters. Since then I'm observing the following:
* Either of the MGMT servers at each site can only reach the active cluster FW member at the opposing site.
* The opposing MGMT server is always reachable
* At the same time, FW cluster members local to the MGMT server being tested from can reach both the active & standby FW member at the opposing site (10.x.171.2 and 10.x.171.3 in the table below)
* So can a linux test host on the same network as the MGMT server (10.x.171.10 in the table below)
In other words, the problem seems to only be affecting the MGMT server!
Here is a text snippet example of what I was doing, followed by a spreadsheet with results:
* From Primary (Site1) MGMT server when FW1-Site2 is Active:
* curl_cli -k https://10.20.171.2 (fw1-site2)
* curl_cli -k https://10.20.171.3 (fw2-site2) DOES NOT WORK!
* curl_cli -k https://10.20.171.5 (mgmt-site2)
or flip firewalls at Site 2
* From Primary (Site1) MGMT server when FW2-Site2 is Active:
* curl_cli -k https://10.20.171.2 (fw1-site2) DOES NOT WORK!
* curl_cli -k https://10.20.171.3 (fw2-site2)
* curl_cli -k https://10.20.171.5 (mgmt-site2)
I did concurrent pcaps on the MGMT server and the cluster member that I'm unable to reach. The FW cluster member receives a TLS hello and responds with a Server hello. Normal stuff.
However, the MGMT server pcap shows the Client hello, but is not receiving anything back.
(Note: I'm testing port 8443 simply because Gaia is hosted there. 443 on Gateways is taken by SSLVPN. Using GAIA web ui as a check that I can connect and negotiate TLS)
This is super confusing because if I perform the same curl_cli from cluster members or a standalone linux host at the same site as the MGMT server, it works FINE.
As you can in screenshot, packets arrive at the destination but response packets are NOT returned to the MGMT server.
They ARE returned if I try the same curl from the MGMT servers gateway (ie: the cluster members local to it) or another test host in the same network as the MGMT server. ?!?!!??!?!?!
I've triple checked route tables, MTU settings on all interfaces.
Help? Did I just discover a bug?
My colleague points out that these line items appear in the "fixed" list of resolved issues and they "sound" related to me:
PRJ-38820, MBS-14060 | ClusterXL | Local connection from the Management interface on a non-standard port (e.g. 8000) may fail. |
PRJ-37883, PMTR-81375 | ClusterXL | Local connection from a Standby member may fail when packets are not fragmented even if the interface MTU is smaller than the packet size. |
Post composition of this post I have uninstalled Take 78 from all cluster members (which puts them back at Take 66). MGMT servers stayed at Take 78. The problem goes away, with no other config changes. Here is the summary table to correlate: