Oh boy, where do I even start with this one. Let's see if anyone here can top this story, it was many years ago but just about made me lose my mind.
I was asked to take a look at a problem for a midsize Check Point customer. Apparently a certain file was being blocked by the firewall with no drop log, and no one could figure out why. I agreed to take a remote look thinking it would be an easy fix, little did I know that it would stretch into many long days culminating in threats to rip out all Check Point products.
So when Customer Windows System A would try to transfer a certain file using cleartext FTP through the firewall to Customer Windows System B, the transfer was stalling at about 5MB and never recovering, always in the exact same place in the file. All other files could be transferred just fine between the two systems and there were no performance issues, and by this time the problematic file had been dubbed the "poison file". The file would also be blocked if it was sent to a completely different destination "System C" through the firewall. Sending the poison file between systems located on System A's same VLAN/subnet worked fine, so the finger was being pointed squarely at the firewall. Customer was only using Firewall and IPSec VPN blades on this particular firewall, so no TP or APCL/URLF. Matching speed and duplex were verified on all network interfaces and switchports.
So I login to the Check Point firewall and run fw ctl zdebug drop. Transfer of the poison file commences and stalls as expected. Nothing in the fw ctl zdebug drop output related to the connection at all, so it is not being dropped/blocked by the Check Point code. Next I disable SecureXL for System A's IP address and start an fw monitor -e. I see all packets of the poison file successfully traversing the firewall, then everything associated with that connection suddenly stops at the time of stall. No retransmissions, FIN/RST or anything at all, it just stops with no further packets. I see exactly the same thing with tcpdump, so it isn't SecureXL which is disabled anyway. Pull all packets up to that point in both the fw monitor and tcpdump captures into Wireshark, and everything is perfectly fine with TCP window, ack/seq numbers and everything else; Wireshark doesn't flag anything suspicious. The connection's packets just stop...
Theorize that perhaps System A is dropping the "poison" packet due to some kind of local firewall/antivirus software and never transmitting it, customer makes sure all that is disabled with no effect. Next I request to install Wireshark on System A as there was nowhere else to easily capture packets between System A and the firewall. At this point the grumbling starts by the customer's network team, who is not very fond of Check Point and would like nothing more than to see them all replaced with Cisco firewalls. Server A's administrator was not very happy about it either, but eventually agrees. We start a Wireshark capture on System A along with an fw monitor -e on the firewall. Poison file transmission starts and then stalls as expected. Notice in local Wireshark capture that at time of stall, System A starts retransmitting the same packet over and over again until the connection eventually dies. Thing is I'm not seeing this "poison packet" nor any of its retransmissions at all in fw monitor and tcpdump, all I see is all the packets leading up to it. So as an example System A is retransmitting TCP SEQ number 33 over and over again, but all I see are the sequence numbers up through 32 on the firewall's captures. The customer's network team certainly likes this finding, as System A has been verified as properly sending the poison packet to the firewall.
This eventually leads to a conversation where I ask exactly what is physically sitting between System A and the firewall's interface. Just a single Cisco Layer 2 switch and nothing else they tell me. I ask them to manually verify this by tracing cables (cue more grumbling) as I suspect there may be some kind of IPS or other device that doesn't like the poison packet or its retransmissions. They verify the path and only the Layer 2 Cisco switch is there. They try changing switchports for the firewall interface and System A, no effect. I personally inspect the configuration of the switch and it is indeed operating in pure Layer 2 mode with no reported STP events or anything else configured that could drop traffic so specifically like this.
Next step is to try replacing the switch, the customer wants to replace it with a switch of the same model but I insist that we use an unmanaged switch that is as stupid as possible during a downtime window, where we will cable System A and the firewall directly through the unmanaged switch. This requires coordination of a downtime window, and as a result I'm hearing that this issue is now being escalated inside the customer's organization beyond the Director level and VPs are starting to get involved. Thinly-veiled threats to pull out this Check Point firewall and the 10 others or so they have are starting to percolate. We get the downtime window approved and hook System A and the firewall together using a piece of crap $30 switch from Best Buy. We start the transfer and...and...
it still stalls at exactly the same place; everything still looks exactly the same in all Wireshark and firewall packet captures. The customer's network team is getting pretty smug now thinking they have won and they'll soon be using Cisco firewalls. At this point I'm on the brink of an existential crisis and starting to seriously question my knowledge and experience. Humbling to say the least.
Finally in desperation this conversation ensues:
Me: What is physically between System A and the firewall?
Them: We already told you, the Cisco switch which is clearly working fine.
Me: How far apart physically are the involved systems in your Data Center? (thinking electrical interference)
Them: Not sure how that is relevant but System A is about 10 feet from the switch and direct wired, and the firewall is on the other side of the Data Center about 80 feet away.
Me: System A is direct wired to the switch with a single cable?
Them: (Exasperated) Yes.
Me: And the firewall is direct wired to the switch with a long Ethernet cable run across the room?
Them: Er, no it is too far for that.
Me: Wait, what?
Them: There is a RJ45 patch panel in the same rack as the switch which leads to a patch panel on the other side of the room near the firewall.
Me: And you are sure there is not some kind of IPS or other device between the ports of the patch panels?
Them: Dude come on, it is just a big bundle of wire going across the room. The firewall is direct wired to the patch panel on that side.
Me: What is the longest direct Ethernet cable you have on hand?
Them: What?
Me: What is the longest direct Ethernet cable you have on hand?
Them: Dunno, probably 100 feet or so.
Me: I want you to direct wire the firewall to the switch itself, bypassing the patch panel. Throw the long cable on the floor for now.
Them: Come on man, that won't have any effect.
Me: Do it anyway please.
Them: I'm calling my manager, this is ridiculous.
(click)
So after some more teeth gnashing and approvals we get a downtime window, the customer has already reached out to Cisco for a meeting at this point and things are looking grim. The cable is temporarily run across the floor of the data center directly connecting the switch and firewall. From the firewall I see the interface go down after being unplugged from the patch panel and come back up with the direct wired connection. I carefully check the speed and duplex on both sides to ensure there is no mismatch. Watching on a shared screen with a phone audio conference active, the transfer of the poison file starts....
BOOM. The poison file transfers successfully in a matter of seconds.
An amazing litany of foul language spews from my phone's speaker, combined in creative ways that would make even a sailor blush. Eventually someone manages to slam the mute button on the other end. After they come back off mute, they incredulously ask what I changed. Nothing. Nothing at all. They don't believe me so I encourage them to hook everything back up through the same patch panel ports as before. They do so and launch the transfer. Immediate stall at 5MB again; they are much faster on the mute button this time. They move both sides to a different patch panel jack, poison file transfer succeeds no problem. Move it back to the original patch panel ports, and the poison file stall at 5MB is back.
At this point for the first time I check the firewall's network interface counters with netstat -ni which I hadn't really looked at due to the specificity of the problem. There are numerous RX-ERRs being logged, ethtool -S shows that they are CRC errors, which start actively incrementing every time a stall happens. Damnit.
So the best I can figure was a bad punch down in the RJ45 patch panel, and when a very specific bit sequence happens to be sent (that sequence was contained within the poison file) a bit would get flipped which then caused the CRC frame checksum to fail verification on the firewall. The packet would be retransmitted, the bit flip would happen again, and it was discarded again over and over. What I did not know at the time is that when a frame has a CRC error like this, it is discarded by the firewall's *NIC hardware* itself and never makes it anywhere near fw monitor and tcpdump, which is why I couldn't see it. There is no way to change this behavior on most NICs either. I never tried to figure out what the poison bit pattern was that caused the bit flip but I wouldn't be surprised if it consisted of the number 6 in groups of three or something. Jeez.
So the big takeaway: CHECK YOUR BLOODY NETWORK COUNTERS.
Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com