Solved: Re: "It's not the firewall" - Page 2

Victor_MR · ‎2021-03-08

Hi (check)mates,

We all know that "the firewall" is one of the first things people blame when there is a traffic issue. A security gateway (a "firewall") do a lot of "intelligent stuff" more than just routing traffic (and -in fact- many network devices today do also "a lot of things") so I understand there is a good reason for thinking about the firewall but, at the same time, there is a big number of times where it's not anything related to it, or when it's not directly related.

I'm looking to build a brief list of typical or somewhat frequent issues we face, where "the firewall" is reported as the root of the issue, but finally it isn't.

It's a quite generic topic, and in terms of troubleshooting it's probably even more generic. Probably there are several simple tools that one should use first, like: traffic logs, fw monitor, tcpdump/cppcap, etcetera. But what I would like to point is not the troubleshooting, but the issues themselves. Of course, assuming the firewall side is properly configured (which would be a "firewall issue" but due to a bad configuration).

To narrow down the circle, I'm specially focusing on networking issues, but every idea is welcomed.

Do you think it would be useful to elaborate such list? 🙂 What issues do you usually find?

Something to start

(I'll update this list with new suggested issues):

A multicast issue with the switches, impacting the cluster behavior.
A VLAN is not populated to all the required switches involved in the cluster communication, specially in VSX environments where not all the VLANs are monitored by default.
Related to remote access VPN (this year has been quite active in that matter), some device at the WAN side is blocking the ISAKMP UDP 4500 packets directed to our Gateways, but not the whole UDP 4500. Typically, another firewall 🤣
Asymmetric routing issues, where the traffic goes through one member and comes back through the other member of the cluster.
Static ARP entries in the "neighbor" routing devices, or ARP cache issues.
Any kind of issue with Internet access: DNS queries not allowed to Internet or to the corporate DNS servers (so we cannot solve our public domains), or TCP ports blocked, or any required URL blocked (typically by a proxy)...
Traffic delays: these are typically more difficult to diagnose. fw monitor with timestamps is one of our friends here.
Layer 1 (physical) issues. Don't forget to review the hardware interface counters!
Missed route at the destination, especially related to the routes related to the encryption domains in a VPN.
Why not: another firewall blocking the communication, of course 😊 Or a forgotten transparent layer-7 device in the middle (like an IPS), installed in a previous age. This may be a variant: "it's not my firewall"
An application or server issue. The simplest example is that the server is not listening in the requested port. A more complex one would be an application layer issue.

Lastly, a little humor. 😊

_Val_ · ‎2021-03-16

One of battle stories I tell on community Live sessions is about just that. When migrating MDS to new IP addresses, we stumbled on one of CMAs failing to install policies on remote FWs. Immediately I asked about third party, and the customer said firmly, no, we do not have anything else. After 30 minutes of arguing and some traffic traces, I have proven to them they had something else blocking traffic. Two hours later, they have identified it was a Juniper FW they forgot about eons ago...

CharlieFoxtrot · ‎2021-05-09

I never expected it, but I've heard that this is not an uncommon distinction between network and telco people, that telco techs are used to thinking in circuits/loops but network guys are just into the packet flow and forget to make sure it can come back? I guess you get used to it before too long either way, but I definitely saw it happen with a lot of new people.

spottex · ‎2021-03-16

Does other Vendor firewall interpretation of RFC's count?

(Albeit I sometimes wonder about CP as well - can't remember any examples right now but all VPN related)

i.e. Sonicwall / Check Point ikev1 IPSEC VPN using PKI

SonicWall firewall does not contain the full certificate chain so you have to install subordinate CA's into the Trusted section of CP. (could be the cert issuer not doing something correctly)

SonicWall Default expects IKE ID to be Distinguished Name (DN) CP sends Main IP Address.

SonicWall only accepts a Cert from CP if the Main IP is the first Alternate name added to the cert when generating the key. If this is not the case, one way VPN initation is still possible but fails if CP initiates the connection.

Victor_MR · ‎2021-03-18

OMG, I'm sure that was hard to diagnose.

I've experienced a similar issue with another vendor several years ago.

I'll try to not open too much the can of worms of "issues with others firewalls" in the original list 😊

eliadr · ‎2021-03-22

In one of the places I worked, we had a major update for the AV on the workstations.
This happened a week apart from upgrading the FWs appliances.
So, since upgrading the FWs (or so we thought), surfing was so sloooooow.
A co-worker of mine chased this for half a year, until we somehow figured out the cause was the web reputation engine in the AV.
The engine couldn't get the signatures updated, so every URl took forever, until the engine gave up.

Cristobal_Johns · ‎2021-04-05

In a somewhat similar problem of not transferring files, the customer even escalated the ticket. It had been detected by Checkpoint, that the problem was on the ISP side. They did not believe it, they had already asked them if they had made changes, and they had sworn they had not.
Then by showing them with tcpdump that the problem was on the ISP side, with a reset of the connection, they admitted that they had implemented an improvement in the IPS software, they turned it off for our client and the problem went away.

HristoGrigorov · ‎2021-04-05

More firewall fun:

genisis__ · ‎2021-04-06

Put a smile on my face!

Vladimir · ‎2021-04-05

Without recounting most of waht was written about above, here is the one from few years ago:

Client states that traffic going through check Point cluster to one of the multihomed AS/400s was being arbitrarely dropped.

As a proof, they have shared the graphs from the Linux MTR (my traceroute) tool.

On a surface of things, it did look like CP was the culprit.

Ended-up being completely unrelated issue, but to prove that it was not a firewall, I had to create a moc environment and have discovered that they were using MTR default TTL 1 per hop, which was decrementing by 1 on each hop. Check Point's processing traffic on iIoO was decrementing the TTL to 0 resulting in a false positives for the tool only, but not for the actual traffic.

genisis__ · ‎2021-04-06

Nice! I've seen something like this before as well, on a poorly written application.

Victor_MR · ‎2021-04-08

Good story!

Sometimes, it's quite difficult to reproduce an issue, test a device or create an identical lab without adding new conditions.

This reminds me several examples where the customer where trying to test the throughput capacity of the gateways (not only the firewalls), and how this is a very difficult thing specially when talking about a layer-7 security device.

I'll add something around this to the original list 🙂 Thanks for sharing!

Zerat · ‎2021-07-06

In the memes section, you've missed the best one - I have it pinned to the board in my office:

If it's there, it must work. Hate to be beta-tester on GA

Timothy_Hall · ‎2021-07-07

A framed, full color copy of this Dilbert strip has been hanging in my ATC training room for years and is always good for a few laughs.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

Tony_Graham · ‎2021-07-06

Well, most times it seems to be something in-between that botches the connection but I have a good story....

Set up a demo computer in a conference room so a company could come in and do a 'dog and pony' show (yah this was 'back in the day'). Computer was working perfectly. Hardwired and ready to go (pre-wireless networks). Don't recall the version of Checkpoint at the time, probably whatever came just after 2.1c. So vendor shows up with a team of like 4 people (why I have no idea). They are really trying to give a hard sell. They try to pull up their website and it doesn't work. 404 Not Found. I get called back into conference room. "It doesn't work!" Hmm, quick perusal of websites says it's fine, just doesn't work with their new fangled website. I tell them I have no idea. Well that got them in a huff and they call me all manner of names, question my competence etc. I just said, "I cannot help you, there is no problem with the machine and nothing is blocking the connection." Fast forward about 40 minutes later, one of their people and my manager decide to try it on my bosses PC....doesn't work there either. So the vendor gets on the line with her support...turns out, they were trying to connect to a URL that their support team had dismantled. Oops. Yah, it wasn't the firewall!

Robert_OBrien · ‎2021-07-08

Like others have said, the biggest one we see is that the remote side isn't even listening on the port they claim is blocked.

_Val_ · ‎2021-07-08

@Robert_OBrien nah, it is another firewall lol

Robert_OBrien · ‎2021-07-09

@_Val_

Sure is....the Windows firewall on the box. 🙂

Robin_H · ‎2021-07-09

IPS side effects.

Ten years ago a consultant asked me to please configure all Lync (later Skype) relevant port objects to be without protocol setting. Said and done, disregarding the fact that we hadn´t even enabled IPS/SmartDefense back then.

Recently a new external SBC and an internal analogue gateway (with static NAT IP) was installed and they wanted a fixed SIP tcp and a few SIP udp ports. This time I used the sip-tcp port object with protocol "SIP_TCP_PROTO" because the IPS activation project was coming along with already being in monitoring mode. IPS is important and needs to used as much as possible, shall it not?

The calls didn´t go through. Signaling happened but no sound.

During a three-hour session, walking through different configurations within the VoIP devices, the SBC admins finally noticed that the firewall had replaced the IP address in the SIP message.
The outgoing message from the external SBC to our analogue gateway showed the public IP in the CALL-ID.
The internal gateway behind the firewall received the message with the CALL-ID containing the private IP of the analogue gateway.

Using a different port object without a protocol setting solved the issue.

( I never actually followed up on this with Checkpoint. Let me know if you think I should )

TJ_Aus · ‎2023-09-21

A physical HP ProLiant Windows server was imaged and migrated to a new HP ProLiant server as it required more "grunt".

The old server hardware was re-built with Windows and used for another purpose with a new name.

Both the imaged server and the new server used network teaming and the imaged server maintained the MAC address which was carried over to the new physical server from the old server. When both servers were active (in the same subnet) one would drop out and could not be contacted, - but it just had to be the firewall's fault didn't it.

Timothy_Hall · ‎2023-10-17

So had another "it's not the firewall" moment I thought I'd share.

Last week I was running a CCSA class and things went pretty badly wrong on one of the lab workstations an attendee was using. Just a "-" status showing up in SmartConsole for everything, random policy installation failures, cpd crashing constantly etc. Also was seeing "BUG: soft lockup for 22s" on several of the Gaia systems. All other attendees on their lab workstations were fine, and all the lab workstations are cut from the same base image so I was starting to wonder what the heck this attendee had done.

Dug into the bad workstation after the class was over and it seemed the configuration was simply flat-out corrupt on all the Check Point R81.20 GA virtual machines, along with soft lockups occurring randomly. Started to suspect a possible workstation hardware issue so I downloaded all the latest diagnostics from Lenovo and ran every possible extended diagnostic including a full media test of all hard drives/SSDs. After a few hours all tests pass with flying colors. Hmm...

At this point I'm desperate and boot up a USB copy of PassMark Memtest86 which I would normally only do when upgrading or changing memory sticks. This is a great program that hammers the heck out of RAM for about 12 hours trying to force bit errors in borderline memory sticks. The memory sticks in this workstation had been tested several years ago when I first installed them and passed 100%.

So I fired it up and not 15 minutes into running the test this appears:

What the...??? Never had a memory stick partially fail like that after working fine for several years. So after swapping out the two pairs of dual memory sticks, moving them around between different slots and retesting, I conclusively identify the pair that is bad. It is most likely only one of two sticks in the pair that is bad, but I've spent so much time screwing around with this, here was my final solution accompanied by some extremely foul language:

Man that felt good. Rest assured that the fragments will be e-recycled properly. My fear that was the e-recycler would just end up putting them up on ebay and some other unfortunate soul would end up with them. Not gonna happen now...

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

Garrett_DirSec · ‎2023-10-17

Hey @Timothy_Hall -- sorry for the frustrations and extended waste of cycles. I'm glad you were able to definitively identify the problem hardware.

I've run into issues with solder "whiskers" and "bad" solder joints that appeared to fail based on temperature. With all the solder contacts on a single memory stick -- let alone a motherboard -- it's truly an incredible thing we don't deal with this type of thing more frequently. The fab process for chip manufacturers must be a fascinating thing.

Tony_Graham · ‎2023-10-17

For sure. Modern electronics stability is quite impressive.

Go engineers!

Tony_Graham · ‎2023-10-17

Interesting one chip from China the other from Mexico.

I notice also that they are mismatched pairs. One is 2133 speed while the other is 2400.

That can be a recipe for disaster. I would have tried one stick at a time and also confirm which speed

the board is supposed to have. It's possible neither stick was actually bad. Having said that I find laptop

memory to be the least reliable.

Garrett_DirSec · ‎2023-10-17

Excellent catch @Tony_Graham . agreed. that's a mismatch and recipe for disaster.

However, I would ask why the mobo would not simply run both at the slowest speed (ie. you would think the 2400 module could run at the 2133 speed)?

the_rock · ‎2023-10-17

O yea, laptop memory never let me down. Firewall one, thats a different story 😂😂

Bob_Zimmerman · ‎2023-10-17

With anything remotely modern (and anything which can run DDR4 is definitely modern enough), speed mismatches aren't a problem. The controller will just train both channels at the lower speed.

I don't think I knew Micron did any assembly in Mexico. I wonder where the chips came from. They don't have any fabs anywhere near Mexico last I checked. Fab 6 in Virginia is probably the closest, but they do their own assembly. They have a bunch of fabs in Taiwan, but again, they do assembly locally.

Tony_Graham · ‎2023-10-17

Probably somewhat depends on the quality of components used. I recently had an HP laptop exhibit the same issues. I tested each module individually since the laptop would not even boot into Windows without bluescreen. It ran fine on one stick but with the other it would not boot. I also tested on different DIMM slots just to be sure the slot itself was not at fault.

Timothy_Hall · ‎2023-10-17

Agreed Bob, speed mismatches really shouldn't matter given that combo of memory worked just fine for about 4 years or more. The motherboard they were plugged into could only go 2133, so I suspect the 2133 module was older and based on my records probably purchased in 2017. The 2400 module appears to have been purchased in 2020 or so, which is a frequent occurrence as the older 2133-only modules became scarce and expensive when faster modules became available. My guess is that the older China 2133 module which ran at maximum speed its whole life is the one that failed; didn't notice the country of assembly difference but if I had, I probably would have had a grudge match between the two modules I snapped to see which one (and country) was actually the bad one. Didn't really care at that point, guilt by association in my book...

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

Vladimir · ‎2023-10-17

When it comes to the labs on laptops, I've had a great experience with Dell Precision 5510s equipped with Xeons and ECC RAM. Never had any stability issues, until my wife dropped it from the desk:). Same model with Core i7 and no ECC RAM is a lot more fickle.

Have you considered running labs on nested ESXis running on Dell r650s servers loaded with RAM to capacity?

the_rock · ‎2023-10-17

For me, EVE-NG is the law for the labs, so easy and convenient.

Are you a member of CheckMates?

"It's not the firewall"