Re: DMZ Default Gateway Intermittent Connection Is...

Ed_Eades · ‎2023-02-19

I am looking for some community help on an issue that we have been troubleshooting and trying to piece together for weeks. I apologize up front for the lengthy post but I want to provide as many details on the issue we have been facing and some of the things we have done to troubleshoot. We are coming to the community to hopefully find someone that may have seen our problem before or has some additional ideas of what could be going on.

We appear to be facing a very random intermittent issue where devices will lose connectivity to the default gateway to our dmz interface. Sometimes this may affect a few DMZ devices and is almost like a hiccup and other times it may affect many more DMZ devices and can be about a 30 second interruption. This is happening with 2 different HA clusters in 2 seperate data centers (primary and secondary data center). At least 95% of Internet traffic is handled by primary data center daily. The issue does seem to happen with more frequency at the secondary data center over the primary. We are a health care provider so 24 x 7 is essential. The issue seldom occurs in evening/over night hours but does sometimes. It does seem to happen more often during core working hours but it may only happen once or sometimes several times that day.

Our topology is an HA cluster with 2 gateways at each Data Center. The inside and outside interface use 10Gb interfaces and the dmz interface is using bond of 5 1gb interfaces. The gateways plug into Cisco switches however they are different Cisco models and platforms at each data center. The bond is setup for Layer3+4 and Slow LACP on CheckPoint side and Cisco side has src-dst-ip as etherchannel load balancing method. Traffic seems to share across each member interface equally and we have had both CheckPoint and Cisco TAC review the bond/port channel setups.

Troubleshooting has lead us to where only the default gateway of the dmz interface loses connectivity. The default gateway is a virtual ip through the HA cluster. During a time when the default gateway is unreachable the dmz physical interface ips are still reachable. We have some icmp monitors to the dmz default gateway setup sourcing from some devices in the dmz and at the time the issue occurs the request packets are not received on the the gateways. The icmp monitors to the dmz interface ips do not fail during the issue and packet captures show all the requests being received.

We have been taking packet captures from different connection points and reviewing other areas of the gateways and switching infrastructure for possible answers. We tried adding static arp entries on the gateways from an example device that loses connection to the dmz default gateway but that did not change behavior. We are running 80.40 and also installed latest Jumbo Take 180 recently to see if it would help. The bond interfaces do show some continuous drop counters occurring but there are not any drops showing on the physical interfaces. The bond drop counters increment consistently and do not correlate to only when the issue is occurring. There are no output drops being registered on the Cisco side. We are at a point where we are just not sure what could be causing the dmz default gateway to basically disappear briefly and at very random intermittent times but somewhat seems to be mostly during more core traffic times. Some of the dmz devices that lose this connection are on the same switch as the gateways so the traffic essentially is contained within one switch.

It seems odd that it is only this dmz interface that seems to be affected and not the other interfaces (inside and outside) on the gateways which are also setup in the HA cluster. Also it is troubling that it is occurring on 2 separate HA clusters located in 2 separate data centers.

I could provide a topology diagram if helpful.

Many Thanks.

Chris_Atkinson · ‎2023-02-19

Any other info you can share... are there load-balancers involved in your DMZ and or do you use VMAC mode for the cluster etc?

Do you see any messages from the cluster indicating failovers?

Assuming you are running Vlans on the DMZ interface are the lowest & highest ID ones on that trunk healthy throughout the switch topology?

CCSM R77/R80/ELITE

Ed_Eades · ‎2023-02-19

We have F5 load-balancers in our DMZ but the CP gateways do not utilize them. We do have VMAC for the cluster interfaces and CCP mode is Manual (Unicast) on the clusters at each data center.

There are not any cluster fail overs occurring at the times of these issues. Would there be any cluster logs to check for messages?

In the primary data center the DMZ interface does not have a vlan but just an IP address assigned to the interface. The secondary data center DMZ interface has 2 Vlans and only one of the vlans has this issue. The other vlan does not loose connection to its default gateway ever or at the time the other vlan does.

Chris_Atkinson · ‎2023-02-19

Any spanning-tree related events in the switch logs / outputs? How are the ports connected to the firewall configured in terms of their portfast settings and similar...

Also note for awareness:

sk83420: Traffic issues during Check Point ClusterXL failover in F5 BIG IP environment

CCSM R77/R80/ELITE

CE_SE · ‎2023-02-19

Two different DC and Clusters...I think I would jump all the way back to the beginning to troubleshoot this. When did it start happening (gather the exact date if possible). Health Care..I'm assuming you follow ITIL..what changes within the expect time frame took place?

I mean for me what you talk about sounds like duplicate IPs or legacy spanning-tree. Sounds like you already thought about ARP/MAC tables..what happens on FW/Switch ARP/MAC? If you run STP a stable root bridge? TCNs High? (topology change notifications for that vlan)

Possible to off load the L3 interfaces to the switch?

Ed_Eades · ‎2023-02-19

We do follow ITIL and have change control policies but we feel this issue may have been present for some time and only recently made apparent. These DMZs have been in place for many many years and there hasn't been too many devices using the dmz as its default gateway. These dmz networks had more F5 VIPs using this IP space than individual devices or servers. Within the last year that has been some more servers placed in these networks and they have health monitors in place that has seem to expose this issue that is occurring. So there is not going to be an expected time frame of when this would have started.

We have looked into spanning-tree but haven't seen anything jump out but could also be missing something to look for, any recommendations? We do see these dmz port channels do not show a lot of topology changes and the last change occurred does not ever correspond to when the issue occurs. sk115963 mentions the rx-errors counters increasing every 2 seconds which actually sounds similar to what we notice on these interfaces. Maybe this sk is relevant to those counters we see increase every few seconds but not sure if related to the issue we are having?

Not sure if we would be able to off load the L3 interfaces to the switches. The gateways connect to L2 switches.

primary DC

Under global config:
spanning-tree mode mst
spanning-tree portfast edge default
spanning-tree portfast edge bpduguard default
spanning-tree portfast edge bpdufilter default
spanning-tree extend system-id

interface Port-channel6
description FW DMZ
switchport
switchport access vlan 125
switchport mode access
spanning-tree portfast edge

secondary DC

Under global Config:
spanning-tree mode mst
spanning-tree portfast default
spanning-tree portfast bpduguard default
spanning-tree extend system-id

interface Port-channel4
description FW DMZ
switchport trunk allowed vlan 217,218
switchport mode trunk

Chris_Atkinson · ‎2023-02-19

Port-channel4 is without "portfast trunk" or is the config just truncated?

Possibly platform specific with respect to defaults but it's not obvious without other outputs e.g.

show spanning-tree detail | begin Port-channel 4
show spanning-tree detail | begin VLAN0217
show spanning-tree detail | begin VLAN0218

CCSM R77/R80/ELITE

Ed_Eades · ‎2023-02-19

Nothing has been truncated in the port-channel4 config provided. I showed the global spanning tree config and the port-channel4 config.

show spanning-tree detail | begin Port-channel4

Port 2380 (Port-channel4) of MST1 is designated forwarding
Port path cost 4000, Port priority 128, Port Identifier 128.2380.
Designated root has priority 1, address xxxx.xxxx.bfc2
Designated bridge has priority 32769, address xxxx.xxxx.2e80
Designated port id is 128.2380, designated path cost 1200
Timers: message age 0, forward delay 0, hold 0
Number of transitions to forwarding state: 1
Link type is point-to-point by default, Internal
BPDU: sent 797507, received 0

show spanning-tree vlan 217 detail

MST1 is executing the mstp compatible Spanning Tree protocol
Bridge Identifier has priority 32768, sysid 1, address xxxx.xxxx.2e80
Configured hello time 2, max age 20, forward delay 15, transmit hold-count 6
Current root has priority 1, address xxxx.xxxx.bfc2
Root port is 2377 (Port-channel1), cost of root path is 1200
Topology change flag not set, detected flag not set
Number of topology changes 3363 last change occurred 2w2d ago
from Port-channel1
Times: hold 1, topology change 35, notification 2
hello 2, max age 20, forward delay 15
Timers: hello 0, topology change 0, notification 0

show spanning-tree vlan 218 detail

MST1 is executing the mstp compatible Spanning Tree protocol
Bridge Identifier has priority 32768, sysid 1, address xxxx.xxxx.2e80
Configured hello time 2, max age 20, forward delay 15, transmit hold-count 6
Current root has priority 1, address xxxx.xxxx.bfc2
Root port is 2377 (Port-channel1), cost of root path is 1200
Topology change flag not set, detected flag not set
Number of topology changes 3363 last change occurred 2w2d ago
from Port-channel1
Times: hold 1, topology change 35, notification 2
hello 2, max age 20, forward delay 15
Timers: hello 0, topology change 0, notification 0

Chris_Atkinson · ‎2023-02-20

I can't say it will solve this specific issue but since Po4 connects to an end host (L3) it would benefit from "spanning-tree portfast trunk".

At a minimum it will generally help the port transition to a forwarding state faster since it isn't a path to another L2 device with risk of a loop and can skip some steps in that transition.

CCSM R77/R80/ELITE

Ed_Eades · ‎2023-02-20

The spanning-tree portfast trunk not being on Po4 was a missed configuration and it was added earlier this evening. Good catch. With the same issue (although not nearly as often) at another data center with portfast already on the trunk port this may not be the fix for the specific issue since we see at both data centers.

After further review today of some past packet captures the issue seems to point to the dmz vmac's but just not quite sure what the issue may be. It seems that the devices inside the dmz that send traffic to internal and the capture shows the destination mac is the vmac evenghough it was sourced from the physical mac. Then on the return traffic it shows the source mac as the physical mac back to the sender destination. The devices showing this behavior in the packet captures are the ones impacted by the issue.

However when traffic has source mac as phyiscal mac we see the return traffic coming back to physical mac, this traffic is not impacted by the issue when it occurs. Still don't know what this may mean or what is causing the issue but we seem to have narrowed it down to vmac related.

Hoping this information may be useful and could lead to other suggestions of where to look next.

Timothy_Hall · ‎2023-02-20

The "rolling outage" behavior just screams firewall ARP cache overflow but that shouldn't manifest itself on only the DMZ interface. Any chance there could be more than 4096 unique MAC addresses directly adjacent to the cluster on all interfaces? How many entries in the ARP cache of the gateway? arp -an | wc -l

Another guess is that there is some device that is occasionally answering ARP requests for the cluster IP with its own (wrong) MAC. I'd be suspicious of the F5s frankly. What happens if you hardcode the default gateway IP to the VMAC on several DMZ servers (not on the firewall). Are these servers now immune to the issue? Could the F5s or another DMZ device somehow have a static proxy ARP for the firewall's default gateway IP, when it should have been added as a static non-proxy ARP mapping instead? Try taking a packet capture only matching ARP traffic on one of your DMZ servers then repeatedly delete the ARP cache entry for the firewall on that server to force it to ask again over and over; you'll have to take the capture on the server itself in order to see the rogue reply. Anything answering that ARP that shouldn't? When the firewall is busier during the day it may not always win the race against the rogue to send a reply first.

Last thing to try is powering off the standby and let it be a cluster of one for awhile. Does the problem disappear? If so it could be something in ClusterXL malfunctioning (less likely) or some kind of strange interaction between ClusterXL and your switching architecture (more likely).

Don't worry about the RX-DRPs as long as there are no corresponding fifo/miss events in the output of ethtool -S, this is just unknown protocol traffic being received and probably not related to your problem.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Ed_Eades · ‎2023-02-24

We may have made a discovery with a system that seems to be moving the VMAC of the Checkpoint over to its Interface. The device that it is moving to is a Citrix Netscaler. After further troubleshooting and capture review we setup the switching infrastructure to log MAC Address moves and we started seeing the VMAC of the CheckPoint moving over to the interface (port-channel) of the Netscalers. This seems to help point to at various times the Netscalers may be acting as the default gateway for DMZ traffic.

Anyone else seen this type of experience from a Netscaler or another device moving the mac of the gateway?

Example of log output from network switch. CheckPoint DMZ on Po6 and Netscaler on Po8

Feb 22 17:41:17: %C4K_EBM-4-HOSTFLAPPING: Host xx:xx:xx:xx:xx:AF in vlan xx5 is moving from port Po6 to port Po8
Feb 22 17:41:23: %C4K_EBM-4-HOSTFLAPPING: Host xx:xx:xx:xx:xx:AF in vlan xx5 is moving from port Po8 to port Po6
Feb 22 18:24:29: %C4K_EBM-4-HOSTFLAPPING: Host xx:xx:xx:xx:xx:AF in vlan xx5 is moving from port Po6 to port Po8
Feb 22 18:24:39: %C4K_EBM-4-HOSTFLAPPING: Host xx:xx:xx:xx:xx:AF in vlan xx5 is moving from port Po8 to port Po6
Feb 22 21:39:49: %C4K_EBM-4-HOSTFLAPPING: Host xx:xx:xx:xx:xx:AF in vlan xx5 is moving from port Po6 to port Po8
Feb 22 21:39:57: %C4K_EBM-4-HOSTFLAPPING: Host xx:xx:xx:xx:xx:AF in vlan xx5 is moving from port Po8 to port Po6

Timothy_Hall · ‎2023-02-24

What you have found aligns perfectly with the behavior you are seeing.

I can't see why a Netscaler or any other device would try to use or take over another system's MAC address, unless you got extremely unlucky and the Netscaler and the cluster happened to dynamically derive the exact same VMAC address for their use. Here is how the Check Point cluster computes the VMAC to use:

First 24 bits

Unique constant value.

00:1C:7F

Next 8 bits

VSX Virtual System ID.

In a VSX Cluster:

Virtual System ID
In a non-VSX Cluster

000000000

Last 16 bits

Unique value that the Management Server

assigns to each cluster object.

This makes the VMAC value unique for each managed cluster.

Unique value for each cluster

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Ed_Eades · ‎2023-02-24

Netscaler MACs on the port-channels to the Netscalers are definitely different than the CheckPoint vmac. The intermittent nature is also strange. Looking for any further suggestions on the how and why this could be happening.

Chris_Atkinson · ‎2023-02-24

Do both the check point & netscalers use bonds to connect to the switching and has the cabling been verified as correct? Perhaps one or more slaves is miss-cabled to the wrong port-channel...

I vaguely also remember the netscalers doing something odd between how they forward/route traffic via their management port but it escapes me for the moment.

CCSM R77/R80/ELITE

Ed_Eades · ‎2023-02-24

The CheckPoint and Netscaler use bonds at one data center and they don't use bonds at the other data center and we have the same issue at both so I would suspect maybe more along with the netscalers possibly doing something odd between how they forward/route traffic.

Gregory_Link · ‎2024-01-03

Ed,

We are having a somewhat similar situation with laptops on a Citrix VPN that sits in one of our DMZs. No issues when the Laptops are on internal network, but when on Citrix VPN they are having all kinds of intermittent slowness issues. Were you ever able to find a solution here?

Ed_Eades · ‎2024-01-04

Our symptoms does sound different than what you are noting however we did find a solution to our issue. The Citrix Gateway was periodically sending traffic with Firewall's MAC Address. Guess this is a documented article buried deep in Citrix archives where we ended up finding with Google searches and had to point Citrix support to their own article before it was admitted to be a known issue, CTX281417. Hope this helps.

Are you a member of CheckMates?

DMZ Default Gateway Intermittent Connection Issue