New Take 78 seems to have broken Management traffi...

dphonovation · ‎2022-10-19

Hi all,

I'm having a very weird issue with HA MGMT involving 2 sites having taken the latest Take 78. Bear with me, its hard to word. I've put a good 2 hrs into trying to word it and tabulate results. Even pulled a colleague in.

Please see network diagram of CP Mgmt:

Primary Site and MGMT is at Site 1. MGMT and ClusterXL members subnet is 10.10.171.0/24
Secondary Site and MGMT is at Site 2. MGMT and ClusterXL members subnet is 10.20.171.0/24

The two sites share VLAN 173 and I use that to route in between them.

Static route at site 1:
S 10.20.171.0/24 via 172.30.0.4, eth4, cost 0, age 13639

Static route at site 2:
S 10.10.171.0/24 via 172.30.0.1, eth4, cost 0, age 837

Each cluster has a VIP on VLAN 173, eth4 for all cluster members:

Site 1: 172.30.0.1
Site 2: 172.30.0.4

I've been using Site 1 to manage both sites since install. All green. Pushing policies, sync'ing mgmt servers, etc. Both MGMT servers are using their respective cluster members as the gateway.

As sanity checks, I like to do "curl_cli -k https://some_cp_member" to verify members or mgmt servers can reach other and that a TLS connection would succeed.

Today, I pushed Take 78 to both clusters. Since then I'm observing the following:

* Either of the MGMT servers at each site can only reach the active cluster FW member at the opposing site.
* The opposing MGMT server is always reachable

* At the same time, FW cluster members local to the MGMT server being tested from can reach both the active & standby FW member at the opposing site (10.x.171.2 and 10.x.171.3 in the table below)
* So can a linux test host on the same network as the MGMT server (10.x.171.10 in the table below)

In other words, the problem seems to only be affecting the MGMT server!

Here is a text snippet example of what I was doing, followed by a spreadsheet with results:

* From Primary (Site1) MGMT server when FW1-Site2 is Active:
* curl_cli -k https://10.20.171.2 (fw1-site2)
* curl_cli -k https://10.20.171.3 (fw2-site2) DOES NOT WORK!
* curl_cli -k https://10.20.171.5 (mgmt-site2)

or flip firewalls at Site 2

* From Primary (Site1) MGMT server when FW2-Site2 is Active:
* curl_cli -k https://10.20.171.2 (fw1-site2) DOES NOT WORK!
* curl_cli -k https://10.20.171.3 (fw2-site2)
* curl_cli -k https://10.20.171.5 (mgmt-site2)

I did concurrent pcaps on the MGMT server and the cluster member that I'm unable to reach. The FW cluster member receives a TLS hello and responds with a Server hello. Normal stuff.
However, the MGMT server pcap shows the Client hello, but is not receiving anything back.

(Note: I'm testing port 8443 simply because Gaia is hosted there. 443 on Gateways is taken by SSLVPN. Using GAIA web ui as a check that I can connect and negotiate TLS)

This is super confusing because if I perform the same curl_cli from cluster members or a standalone linux host at the same site as the MGMT server, it works FINE.

As you can in screenshot, packets arrive at the destination but response packets are NOT returned to the MGMT server.

They ARE returned if I try the same curl from the MGMT servers gateway (ie: the cluster members local to it) or another test host in the same network as the MGMT server. ?!?!!??!?!?!

I've triple checked route tables, MTU settings on all interfaces.

Help? Did I just discover a bug?

My colleague points out that these line items appear in the "fixed" list of resolved issues and they "sound" related to me:

PRJ-38820,
MBS-14060

ClusterXL

Local connection from the Management interface on a non-standard port (e.g. 8000) may fail.

PRJ-37883,
PMTR-81375

ClusterXL

Local connection from a Standby member may fail when packets are not fragmented even if the interface MTU is smaller than the packet size.

Post composition of this post I have uninstalled Take 78 from all cluster members (which puts them back at Take 66). MGMT servers stayed at Take 78. The problem goes away, with no other config changes. Here is the summary table to correlate:

Chris_Atkinson · ‎2022-10-19

What's the default gateway for the Management servers vs the Linux host?

Traffic to the opposite sites management must traverse the remote firewall based on the current routing?

Did you observe any anti-spoofing drop logs for traffic or anything in "fw ctl zdebug + drop" output at the time?

Otherwise if it's a *clear* before & after please take it to TAC for investigation.

CCSM R77/R80/ELITE

dphonovation · ‎2022-10-20

>What's the default gateway for the Management servers vs the Linux host?

Exact same. MGMT Server and test host sit on the same network. They are both using a CLusterXL vip as their gateway on a local VLAN. It is the only interface they have

> Traffic to the opposite sites management must traverse the remote firewall based on the current routing?

Correct. They have a shared vlan (172.30.0.0/28, the cyan line in the diagram). Each member has an ip in that network + their cluster VIP. Then each member has a route for the opposing sites Checkpoint MGMT network to the opposing sites VIP.

> Did you observe any anti-spoofing drop logs for traffic or anything in "fw ctl zdebug + drop" output at the time?

No to anti spoofing, I also have it turned off on all interfaces at the moment. I forgot about "fw ctl zdebug + drop" before I went for the revert unfortunately.

> Otherwise if it's a *clear* before & after please take it to TAC for investigation.
Already open. Just waiting on a call back

dphonovation · ‎2022-10-27

Sort of a semi-solution. I've had to move Cluster Sync in topology from a dedicated network to the same eth pair that MGMT for all members sits on. Here it goes:

From site1s perspective (the receiving end of a test curl_cli) there are 3 LANs involved, each member has a drop in it:

eth2 MGMT - 10.10.171.0/24 ( FW1=.2 FW2=.3 VIP =.1) [This is set to Cluster only]
eth3 Cluster XL Sync - 172.17.18.0/29 ( FW1=.2 FW2=.3 No VIP) [This is set to Sync only]
eth4 Inter-Routing - 172.30.0.0/28 ( FW1=.2 FW3=.3 VIP=.1) [This is set to Cluster only & the one VLAN they share to route inbetween each other]

From site2s perspective (the initiating end of a test curl_cli) it is similar:

eth2 MGMT - 10.20.171.0/24 ( FW1=.2 FW2=.3 VIP =.1) [This is set to Cluster only]
eth3 Cluster XL Sync - 172.17.18.8/29 ( FW1=.9 FW2=.10 No VIP) [This is set to Sync only]
eth4 Inter-Routing - 172.30.0.0/28 ( FW1=.5 FW3=.6 VIP=.4) [This is set to Cluster only & the one VLAN they share to route inbetween each other]

Site1 Routing Table on active gateway:
Destination Gateway Genmask Flags Metric Ref Use Iface
10.10.171.0 0.0.0.0 255.255.255.0 U 0 0 0 eth2
10.20.171.0 172.30.0.4 255.255.255.0 UG 0 0 0 eth4
172.17.18.0 0.0.0.0 255.255.255.248 U 0 0 0 eth3
172.30.0.0 0.0.0.0 255.255.255.240 U 0 0 0 eth4

Despite this, for traffic coming from MGMT at the opposing site to the standby member (From 10.20.171.5 to 10.10.171.3) - the only place I could find traffic via tcpdumps on the active gateway at Site 1 - was on eth3!

eth3 being the interface dedicated to Cluster Sync. You would only expect this traffic from eth4 (inter routing vlan) and eth2 (the mgmt home). eth3 has nothing to do with this.

I then set eth3 to be a "private" in topology, thereby turning off clusterxl sync; then moved it to MGMT inteface by making it a "cluster+Sync" type. The problem goes away. Test curl_cli works and cluster is green from both MGMT servers.

If you flip MGMT servers, the same thing occurs (ie: From 10.10.171.4 to 10.20.171.3) until the topology workaround is changed there to also have Cluster + Sync on the MGMT interface.

Notes of actions taken:

.. eth3 on active member in site 1 (sole clusterxl sync dedicated interface) was seeing this traffic? [why when this network is dedicated sync on a different subnet?]
.. on cp-cluster-site1
.. turn off sync from eth3 interfaces completely (chg to private)
.. change eth2 (mgmt interface, currently only a Cluster type. Change to Cluster + Sync)
.. push policy
.. restarted router daemon with "tellpm process:routed;sleep 5;tellpm process:routed t" otherwise cpha routed daemon listening on wrong interface (netstat -pan | grep routed)

.. traffic no longer flowing through eth3, but eth2
.. opposing mgmt server can now reach standby member!!!! (smartconsole and curl_cli) (this is what we want)

.. change eth3 back to sync only
.. push policy / ensure cluster is healthy ("cphaprob state" and all green in smart console)
.. broken again?
.. theoretically in this state rolling back to take 66 should bring it back alive, just like doing so before (haven't tested yet)

TL:DR:

After updating to take 78 I had to change topology to:

eth2 MGMT - 10.10.171.0/24 ( FW1=.2 FW2=.3 VIP =.1) [This is set to Cluster+Sync]
eth3 Cluster XL Sync - 172.17.18.0/29 ( FW1=.2 FW2=.3 No VIP) [SET TO PRIVATE OR DELETE]
eth4 Inter-Routing - 172.30.0.0/28 ( FW1=.2 FW3=.3 VIP=.1) [This is set to Cluster only & the one VLAN they share to route inbetween each other]

It appears that since the destination IP was another firewall member, for some reason at least one of the gateways was choosing to send this traffic thru eth3 (172.17.18.0/29) - where the network type is "trusted" (as seen in smartconsole) because it is a clusterxl sync only network??? This despite the fact that the packet was src:10.20.171.5/24 and dst:10.10.171.3/24. And this same firewall has a drop in 10.10.171.0/24 on eth2 (direct route; on net).

In other words, it seems it was preferencing the trusted "ClusterXL Sync" network eth3 (172.17.18.0/29) to reach the standby member and sending it out the wrong interface, as opposed to the dedicated MGMT network; despite what the routing table says... hope that makes sense.

This leads me to believe there is some kind of smart "handover of packets" from gateway members to each other for processing? and it prefers to do this over the known "trusted" sync network and ignoring the routing table altogether? And this must have changed behavior somehow from Take 66 to Take 78? I just see no reason these packets should otherwise be on eth3.

Prior to this discovery the conditions I could list seems to be:

If these 3 conditions are true:

1) Firewall member is on Take 78
2) firewall member is in standby
3) firewall member is being managed cross sub-net

... the mgmt server cannot reach the member.

Perhaps there is a 4th: firewall member must be in a cluster with a dedicated sync interface (not being used for mgmt)?

I can stay like this for a short time in order to have a healthy/manageable cluster, but I thought it was best practice to have Cluster Sync on a dedicated network and I'd still prefer to use that option.

Still have a TAC case open on this. Same update going to them.

Are you a member of CheckMates?

New Take 78 seems to have broken Management traffic to cluster members in a different subnet?