Solved: "It's not the firewall"

Victor_MR · ‎2021-03-08

Hi (check)mates,

We all know that "the firewall" is one of the first things people blame when there is a traffic issue. A security gateway (a "firewall") do a lot of "intelligent stuff" more than just routing traffic (and -in fact- many network devices today do also "a lot of things") so I understand there is a good reason for thinking about the firewall but, at the same time, there is a big number of times where it's not anything related to it, or when it's not directly related.

I'm looking to build a brief list of typical or somewhat frequent issues we face, where "the firewall" is reported as the root of the issue, but finally it isn't.

It's a quite generic topic, and in terms of troubleshooting it's probably even more generic. Probably there are several simple tools that one should use first, like: traffic logs, fw monitor, tcpdump/cppcap, etcetera. But what I would like to point is not the troubleshooting, but the issues themselves. Of course, assuming the firewall side is properly configured (which would be a "firewall issue" but due to a bad configuration).

To narrow down the circle, I'm specially focusing on networking issues, but every idea is welcomed.

Do you think it would be useful to elaborate such list? 🙂 What issues do you usually find?

Something to start

(I'll update this list with new suggested issues):

A multicast issue with the switches, impacting the cluster behavior.
A VLAN is not populated to all the required switches involved in the cluster communication, specially in VSX environments where not all the VLANs are monitored by default.
Related to remote access VPN (this year has been quite active in that matter), some device at the WAN side is blocking the ISAKMP UDP 4500 packets directed to our Gateways, but not the whole UDP 4500. Typically, another firewall 🤣
Asymmetric routing issues, where the traffic goes through one member and comes back through the other member of the cluster.
Static ARP entries in the "neighbor" routing devices, or ARP cache issues.
Any kind of issue with Internet access: DNS queries not allowed to Internet or to the corporate DNS servers (so we cannot solve our public domains), or TCP ports blocked, or any required URL blocked (typically by a proxy)...
Traffic delays: these are typically more difficult to diagnose. fw monitor with timestamps is one of our friends here.
Layer 1 (physical) issues. Don't forget to review the hardware interface counters!
Missed route at the destination, especially related to the routes related to the encryption domains in a VPN.
Why not: another firewall blocking the communication, of course 😊 Or a forgotten transparent layer-7 device in the middle (like an IPS), installed in a previous age. This may be a variant: "it's not my firewall"
An application or server issue. The simplest example is that the server is not listening in the requested port. A more complex one would be an application layer issue.

Lastly, a little humor. 😊

Timothy_Hall · ‎2021-03-15

Oh boy, where do I even start with this one. Let's see if anyone here can top this story, it was many years ago but just about made me lose my mind.

I was asked to take a look at a problem for a midsize Check Point customer. Apparently a certain file was being blocked by the firewall with no drop log, and no one could figure out why. I agreed to take a remote look thinking it would be an easy fix, little did I know that it would stretch into many long days culminating in threats to rip out all Check Point products.

So when Customer Windows System A would try to transfer a certain file using cleartext FTP through the firewall to Customer Windows System B, the transfer was stalling at about 5MB and never recovering, always in the exact same place in the file. All other files could be transferred just fine between the two systems and there were no performance issues, and by this time the problematic file had been dubbed the "poison file". The file would also be blocked if it was sent to a completely different destination "System C" through the firewall. Sending the poison file between systems located on System A's same VLAN/subnet worked fine, so the finger was being pointed squarely at the firewall. Customer was only using Firewall and IPSec VPN blades on this particular firewall, so no TP or APCL/URLF. Matching speed and duplex were verified on all network interfaces and switchports.

So I login to the Check Point firewall and run fw ctl zdebug drop. Transfer of the poison file commences and stalls as expected. Nothing in the fw ctl zdebug drop output related to the connection at all, so it is not being dropped/blocked by the Check Point code. Next I disable SecureXL for System A's IP address and start an fw monitor -e. I see all packets of the poison file successfully traversing the firewall, then everything associated with that connection suddenly stops at the time of stall. No retransmissions, FIN/RST or anything at all, it just stops with no further packets. I see exactly the same thing with tcpdump, so it isn't SecureXL which is disabled anyway. Pull all packets up to that point in both the fw monitor and tcpdump captures into Wireshark, and everything is perfectly fine with TCP window, ack/seq numbers and everything else; Wireshark doesn't flag anything suspicious. The connection's packets just stop...

Theorize that perhaps System A is dropping the "poison" packet due to some kind of local firewall/antivirus software and never transmitting it, customer makes sure all that is disabled with no effect. Next I request to install Wireshark on System A as there was nowhere else to easily capture packets between System A and the firewall. At this point the grumbling starts by the customer's network team, who is not very fond of Check Point and would like nothing more than to see them all replaced with Cisco firewalls. Server A's administrator was not very happy about it either, but eventually agrees. We start a Wireshark capture on System A along with an fw monitor -e on the firewall. Poison file transmission starts and then stalls as expected. Notice in local Wireshark capture that at time of stall, System A starts retransmitting the same packet over and over again until the connection eventually dies. Thing is I'm not seeing this "poison packet" nor any of its retransmissions at all in fw monitor and tcpdump, all I see is all the packets leading up to it. So as an example System A is retransmitting TCP SEQ number 33 over and over again, but all I see are the sequence numbers up through 32 on the firewall's captures. The customer's network team certainly likes this finding, as System A has been verified as properly sending the poison packet to the firewall.

This eventually leads to a conversation where I ask exactly what is physically sitting between System A and the firewall's interface. Just a single Cisco Layer 2 switch and nothing else they tell me. I ask them to manually verify this by tracing cables (cue more grumbling) as I suspect there may be some kind of IPS or other device that doesn't like the poison packet or its retransmissions. They verify the path and only the Layer 2 Cisco switch is there. They try changing switchports for the firewall interface and System A, no effect. I personally inspect the configuration of the switch and it is indeed operating in pure Layer 2 mode with no reported STP events or anything else configured that could drop traffic so specifically like this.

Next step is to try replacing the switch, the customer wants to replace it with a switch of the same model but I insist that we use an unmanaged switch that is as stupid as possible during a downtime window, where we will cable System A and the firewall directly through the unmanaged switch. This requires coordination of a downtime window, and as a result I'm hearing that this issue is now being escalated inside the customer's organization beyond the Director level and VPs are starting to get involved. Thinly-veiled threats to pull out this Check Point firewall and the 10 others or so they have are starting to percolate. We get the downtime window approved and hook System A and the firewall together using a piece of crap $30 switch from Best Buy. We start the transfer and...and...

it still stalls at exactly the same place; everything still looks exactly the same in all Wireshark and firewall packet captures. The customer's network team is getting pretty smug now thinking they have won and they'll soon be using Cisco firewalls. At this point I'm on the brink of an existential crisis and starting to seriously question my knowledge and experience. Humbling to say the least.

Finally in desperation this conversation ensues:

Me: What is physically between System A and the firewall?

Them: We already told you, the Cisco switch which is clearly working fine.

Me: How far apart physically are the involved systems in your Data Center? (thinking electrical interference)

Them: Not sure how that is relevant but System A is about 10 feet from the switch and direct wired, and the firewall is on the other side of the Data Center about 80 feet away.

Me: System A is direct wired to the switch with a single cable?

Them: (Exasperated) Yes.

Me: And the firewall is direct wired to the switch with a long Ethernet cable run across the room?

Them: Er, no it is too far for that.

Me: Wait, what?

Them: There is a RJ45 patch panel in the same rack as the switch which leads to a patch panel on the other side of the room near the firewall.

Me: And you are sure there is not some kind of IPS or other device between the ports of the patch panels?

Them: Dude come on, it is just a big bundle of wire going across the room. The firewall is direct wired to the patch panel on that side.

Me: What is the longest direct Ethernet cable you have on hand?

Them: What?

Me: What is the longest direct Ethernet cable you have on hand?

Them: Dunno, probably 100 feet or so.

Me: I want you to direct wire the firewall to the switch itself, bypassing the patch panel. Throw the long cable on the floor for now.

Them: Come on man, that won't have any effect.

Me: Do it anyway please.

Them: I'm calling my manager, this is ridiculous.

(click)

So after some more teeth gnashing and approvals we get a downtime window, the customer has already reached out to Cisco for a meeting at this point and things are looking grim. The cable is temporarily run across the floor of the data center directly connecting the switch and firewall. From the firewall I see the interface go down after being unplugged from the patch panel and come back up with the direct wired connection. I carefully check the speed and duplex on both sides to ensure there is no mismatch. Watching on a shared screen with a phone audio conference active, the transfer of the poison file starts....

BOOM. The poison file transfers successfully in a matter of seconds.

An amazing litany of foul language spews from my phone's speaker, combined in creative ways that would make even a sailor blush. Eventually someone manages to slam the mute button on the other end. After they come back off mute, they incredulously ask what I changed. Nothing. Nothing at all. They don't believe me so I encourage them to hook everything back up through the same patch panel ports as before. They do so and launch the transfer. Immediate stall at 5MB again; they are much faster on the mute button this time. They move both sides to a different patch panel jack, poison file transfer succeeds no problem. Move it back to the original patch panel ports, and the poison file stall at 5MB is back.

At this point for the first time I check the firewall's network interface counters with netstat -ni which I hadn't really looked at due to the specificity of the problem. There are numerous RX-ERRs being logged, ethtool -S shows that they are CRC errors, which start actively incrementing every time a stall happens. Damnit.

So the best I can figure was a bad punch down in the RJ45 patch panel, and when a very specific bit sequence happens to be sent (that sequence was contained within the poison file) a bit would get flipped which then caused the CRC frame checksum to fail verification on the firewall. The packet would be retransmitted, the bit flip would happen again, and it was discarded again over and over. What I did not know at the time is that when a frame has a CRC error like this, it is discarded by the firewall's *NIC hardware* itself and never makes it anywhere near fw monitor and tcpdump, which is why I couldn't see it. There is no way to change this behavior on most NICs either. I never tried to figure out what the poison bit pattern was that caused the bit flip but I wouldn't be surprised if it consisted of the number 6 in groups of three or something. Jeez.

So the big takeaway: CHECK YOUR BLOODY NETWORK COUNTERS.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

View solution in original post

PhoneBoy · ‎2021-03-08

Arp cache often issues get blamed on the firewall.

the_rock · ‎2021-03-08

Though this is good post to discuss, its very broad when it comes to knowledge and work done by different customers. Personally, I often find people would blame the firewall without even doing basic network tests (ping, traceroute, route lookup...). I think it would be VERY HELPFUL if there was an sk on some basic things to check on CP firewalls before engaging TAC.

Best,
Andy

Victor_MR · ‎2021-03-09

Sadly true. In fact, most of the issues (if not all) in this post would be valid for any firewall vendor.

Happy to know that you find it useful! I expect the wise people stop by this thread and leave their two cents 🙂

_Val_ · ‎2021-03-09

Oh, don't I have those stories...

Once, a customer from Scandinavia was a severe case of FW breaking down VOIP system. After changing the FW appliances, they lost phone system, which could place the calls still, but parties could not hear each other. After several escalations, I was visiting the customer on prem, to see what the hell is going on. Long story short, someone put a static ARP entry on their GateKeeper, pointing to the "old" FW mac address.

On the other hand, I have quite a few opposite examples, where I was pretty confident it was not a FW myself, but proven wrong.

Victor_MR · ‎2021-03-10

Thanks!

I forgot the ARP issues and definitely that is one of the typical things that may happen outside the gateway. It's not so common, but it happens from time to time.

Victor_MR · ‎2021-03-10

I'll add also other things I've thought about.

Any kind of issue with Internet access: DNS queries not allowed to Internet or to the corporate DNS servers (so we cannot solve our public domains), or TCP ports blocked, or any required URL blocked (typically by a proxy)...
Traffic delays: these are typically more difficult to diagnose. fw monitor with timestamps is one of our friends here.

I'll update the original post 🙂

love-cw · ‎2021-03-16

DNS Conditional Forwarding has hosed things in the past. The joke is its always DNS.

Bob_Zimmerman · ‎2021-03-10

Drops for retransmitted SYN with different window scale, or retransmitted SYN with different sequence number are not problems on the firewall. They are instead symptoms. They happen when the client doesn't get a response to its connection, and it tries again with slightly different settings in case some box in the path didn't like the settings of the original SYN.

Ruan_Kotze · ‎2021-03-16

Oh boy, had an issue last week where VPN client traffic was severely impacted to the point of clients not being able to use any internal apps. Firewall was dropping packets with "First packet is not SYN". Hours of troubleshooting (of course nothing changed on the network) and one SpeedTest later, we discovered that the client upgraded the ISP link, but the ISP messed up the shaping on the upstream router, giving the client 100Mb down and 1024Kb up...

JozkoMrkvicka · ‎2021-03-10

ooh, let me check dozens of incident tickets we have in firewall queue ...

"Firewall is blocking the traffic, allow communication NOW !" ... in fact - route on destination is missing...

"Firewall stopped passing traffic from 1.3.2021 00:01" .. in fact - regulary requested temporary time rule till the end of February expired (requested by the same person who created ticket)...

"Firewall is blocking specific tcp/udp port" ... in fact - port is closed on destination server

but to blame also "us", once you dont see any logs, it doesnt mean there is no traffic ... just rule is not logging 😉

Kind regards,
Jozko Mrkvicka

Lukes · ‎2021-03-16

We have queue full of the similar incidents, but the worst ones are for slowness.

They also wanted to disable TCP state checking => it has never help to fix the issue 🙂

Timothy_Hall · ‎2021-03-15