Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Victor_MR
Employee Employee
Employee
Jump to solution

"It's not the firewall"

Hi (check)mates,

We all know that "the firewall" is one of the first things people blame when there is a traffic issue. A security gateway (a "firewall") do a lot of "intelligent stuff" more than just routing traffic (and -in fact- many network devices today do also "a lot of things") so I understand there is a good reason for thinking about the firewall but, at the same time, there is a big number of times where it's not anything related to it, or when it's not directly related.

I'm looking to build a brief list of typical or somewhat frequent issues we face, where "the firewall" is reported as the root of the issue, but finally it isn't.

It's a quite generic topic, and in terms of troubleshooting it's probably even more generic. Probably there are several simple tools that one should use first, like: traffic logs, fw monitor, tcpdump/cppcap, etcetera. But what I would like to point is not the troubleshooting, but the issues themselves. Of course, assuming the firewall side is properly configured (which would be a "firewall issue" but due to a bad configuration).

To narrow down the circle, I'm specially focusing on networking issues, but every idea is welcomed.

Do you think it would be useful to elaborate such list? 🙂 What issues do you usually find?

Something to start

(I'll update this list with new suggested issues):

  • A multicast issue with the switches, impacting the cluster behavior.
  • A VLAN is not populated to all the required switches involved in the cluster communication, specially in VSX environments where not all the VLANs are monitored by default.
  • Related to remote access VPN (this year has been quite active in that matter), some device at the WAN side is blocking the ISAKMP UDP 4500 packets directed to our Gateways, but not the whole UDP 4500. Typically, another firewall 🤣
  • Asymmetric routing issues, where the traffic goes through one member and comes back through the other member of the cluster.
  • Static ARP entries in the "neighbor" routing devices, or ARP cache issues.
  • Any kind of issue with Internet access: DNS queries not allowed to Internet or to the corporate DNS servers (so we cannot solve our public domains), or TCP ports blocked, or any required URL blocked (typically by a proxy)...
  • Traffic delays: these are typically more difficult to diagnose. fw monitor with timestamps is one of our friends here.
  • Layer 1 (physical) issues. Don't forget to review the hardware interface counters!
  • Missed route at the destination, especially related to the routes related to the encryption domains in a VPN.
  • Why not: another firewall blocking the communication, of course 😊 Or a forgotten transparent layer-7 device in the middle (like an IPS), installed in a previous age. This may be a variant: "it's not my firewall"
  • An application or server issue. The simplest example is that the server is not listening in the requested port. A more complex one would be an application layer issue.

Lastly, a little humor. 😊

the-moment-when-you-prove-its-not-the-firewall.jpg5c9e05a465c753417dbde949ee285fd9e56f0739ad0254de838a7d8c61c1a318.jpg169j56.jpg1sg25m.jpg

 

(1)
61 Replies
_Val_
Admin
Admin

One of battle stories I tell on community Live sessions is about just that. When migrating MDS to new IP addresses, we stumbled on one of CMAs failing to install policies on remote FWs. Immediately I asked about third party, and the customer said firmly, no, we do not have anything else. After 30 minutes of arguing and some traffic traces, I have proven to them they had something else blocking traffic. Two hours later, they have identified it was a Juniper FW they forgot about eons ago...

CharlieFoxtrot
Explorer

I never expected it, but I've heard that this is not an uncommon distinction between network and telco people, that telco techs are used to thinking in circuits/loops but network guys are just into the packet flow and forget to make sure it can come back? I guess you get used to it before too long either way, but I definitely saw it happen with a lot of new people.

0 Kudos
spottex
Collaborator

Does other Vendor firewall interpretation of RFC's count?

(Albeit I sometimes wonder about CP as well - can't remember any examples right now but all VPN related)

i.e. Sonicwall / Check Point ikev1 IPSEC VPN using PKI

SonicWall firewall does not contain the full certificate chain so you have to install subordinate CA's into the Trusted section of CP. (could be the cert issuer not doing something correctly)


SonicWall Default expects IKE ID to be Distinguished Name (DN) CP sends Main IP Address.


SonicWall only accepts a Cert from CP if the Main IP is the first Alternate name added to the cert when generating the key. If this is not the case, one way VPN initation is still possible but fails if CP initiates the connection.

0 Kudos
Victor_MR
Employee Employee
Employee

OMG, I'm sure that was hard to diagnose.

I've experienced a similar issue with another vendor several years ago.

I'll try to not open too much the can of worms of "issues with others firewalls" in the original list 😊

0 Kudos
eliadr
Participant

In one of the places I worked, we had a major update for the AV on the workstations.
This happened a week apart from upgrading the FWs appliances.
So, since upgrading the FWs (or so we thought), surfing was so sloooooow.
A co-worker of mine chased this for half a year, until we somehow figured out the cause was the web reputation engine in the AV.
The engine couldn't get the signatures updated, so every URl took forever, until the engine gave up.

0 Kudos
Cristobal_Johns
Employee
Employee

In a somewhat similar problem of not transferring files, the customer even escalated the ticket. It had been detected by Checkpoint, that the problem was on the ISP side. They did not believe it, they had already asked them if they had made changes, and they had sworn they had not.
Then by showing them with tcpdump that the problem was on the ISP side, with a reset of the connection, they admitted that they had implemented an improvement in the IPS software, they turned it off for our client and the problem went away.

0 Kudos
HristoGrigorov

More firewall fun:

fw-fun-1.PNG

 

fw-fun-2.PNG

 

genisis__
Leader Leader
Leader

Put a smile on my face! 

Vladimir
Champion
Champion

Without recounting most of waht was written about above, here is the one from few years ago:

Client states that traffic going through check Point cluster to one of the multihomed AS/400s was being arbitrarely dropped.

As a proof, they have shared the graphs from the Linux MTR (my traceroute) tool.

On a surface of things, it did look like CP was the culprit.

Ended-up being completely unrelated issue, but to prove  that it was not a firewall, I had to create a moc environment and have discovered that they were using MTR default TTL 1 per hop, which was decrementing by 1 on each hop. Check Point's processing traffic on iIoO was decrementing the  TTL to 0 resulting in a false positives for the tool only, but not for the actual traffic.

CP_MTR_Screenshot0.jpg

CP_MTR_Screenshot4.jpg

 

CP_MTR_Screenshot1.jpg

CP_MTR_Screenshot2.jpg

CP_MTR_Screenshot3.jpg

 

genisis__
Leader Leader
Leader

Nice!  I've seen something like this before as well, on a poorly written application.

0 Kudos
Victor_MR
Employee Employee
Employee

Good story!

Sometimes, it's quite difficult to reproduce an issue, test a device or create an identical lab without adding new conditions.

This reminds me several examples where the customer where trying to test the throughput capacity of the gateways (not only the firewalls), and how this is a very difficult thing specially when talking about a layer-7 security device.

I'll add something around this to the original list 🙂 Thanks for sharing!

Zerat
Participant

In the memes section, you've missed the best one - I have it pinned to the board in my office:

dilbert_blame_the_firewall.jpg

#######################################
If it's there, it must work. Hate to be beta-tester on GA
Timothy_Hall
Legend Legend
Legend

A framed, full color copy of this Dilbert strip has been hanging in my ATC training room for years and is always good for a few laughs.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
0 Kudos
Tony_Graham
Advisor

Well, most times it seems to be something in-between that botches the connection but I have a good story....

Set up a demo computer in a conference room so a company could come in and do a 'dog and pony' show (yah this was 'back in the day'). Computer was working perfectly. Hardwired and ready to go (pre-wireless networks). Don't recall the version of Checkpoint at the time, probably whatever came just after 2.1c. So vendor shows up with a team of like 4 people (why I have no idea). They are really trying to give a hard sell. They try to pull up their website and it doesn't work. 404 Not Found. I get called back into conference room. "It doesn't work!" Hmm, quick perusal of websites says it's fine, just doesn't work with their new fangled website. I tell them I have no idea. Well that got them in a huff and they call me all manner of names, question my competence etc. I just said, "I cannot help you, there is no problem with the machine and nothing is blocking the connection." Fast forward about 40 minutes later, one of their people and my manager decide to try it on my bosses PC....doesn't work there either. So the vendor gets on the line with her support...turns out, they were trying to connect to a URL that their support team had dismantled. Oops. Yah, it wasn't the firewall!

Robert_OBrien
Participant

Like others have said, the biggest one we see is that the remote side isn't even listening on the port they claim is blocked.

0 Kudos
_Val_
Admin
Admin

@Robert_OBrien nah, it is another firewall lol

Robert_OBrien
Participant

@_Val_ 

Sure is....the Windows firewall on the box.  🙂

 

Robin_H
Contributor

IPS side effects.

Ten years ago a consultant asked me to please configure all Lync (later Skype) relevant port objects to be without protocol setting. Said and done, disregarding the fact that we hadn´t even enabled IPS/SmartDefense back then.

Recently a new external SBC and an internal analogue gateway (with static NAT IP) was installed and they wanted a fixed SIP tcp and a few SIP udp ports. This time I used the sip-tcp port object with protocol "SIP_TCP_PROTO" because the IPS activation project was coming along with already being in monitoring mode. IPS is important and needs to used as much as possible, shall it not?

The calls didn´t go through. Signaling happened but no sound.

During a three-hour session, walking through different configurations within the VoIP devices, the SBC admins finally noticed that the firewall had replaced the IP address in the SIP message.
The outgoing message from the external SBC to our analogue gateway showed the public IP in the CALL-ID.
The internal gateway behind the firewall received the message with the CALL-ID containing the private IP of the analogue gateway.

Using a different port object without a protocol setting solved the issue.

( I never actually followed up on this with Checkpoint. Let me know if you think I should )

0 Kudos
TJ_Aus
Contributor

A physical HP ProLiant Windows server was imaged and migrated to a new HP ProLiant server as it required more "grunt".

The old server hardware was re-built with Windows and used for another purpose with a new name.

Both the imaged server and the new server used network teaming and the imaged server maintained the MAC address which was carried over to the new physical server from the old server. When both servers were active (in the same subnet) one would drop out and could not be contacted, - but it just had to be the firewall's fault didn't it.

Timothy_Hall
Legend Legend
Legend

So had another "it's not the firewall" moment I thought I'd share.  

Last week I was running a CCSA class and things went pretty badly wrong on one of the lab workstations an attendee was using.  Just a "-" status showing up in SmartConsole for everything, random policy installation failures, cpd crashing constantly etc.  Also was seeing "BUG: soft lockup for 22s" on several of the Gaia systems.  All other attendees on their lab workstations were fine, and all the lab workstations are cut from the same base image so I was starting to wonder what the heck this attendee had done.

Dug into the bad workstation after the class was over and it seemed the configuration was simply flat-out corrupt on all the Check Point R81.20 GA virtual machines, along with soft lockups occurring randomly.  Started to suspect a possible workstation hardware issue so I downloaded all the latest diagnostics from Lenovo and ran every possible extended diagnostic including a full media test of all hard drives/SSDs.  After a few hours all tests pass with flying colors.  Hmm...

At this point I'm desperate and boot up a USB copy of PassMark Memtest86 which I would normally only do when upgrading or changing memory sticks.  This is a great program that hammers the heck out of RAM for about 12 hours trying to force bit errors in borderline memory sticks.  The memory sticks in this workstation had been tested several years ago when I first installed them and passed 100%.

So I fired it up and not 15 minutes into running the test this appears:

fail.jpg

 

What the...??? Never had a memory stick partially fail like that after working fine for several years.  So after swapping out the two pairs of dual memory sticks, moving them around between different slots and retesting, I conclusively identify the pair that is bad.  It is most likely only one of two sticks in the pair that is bad, but I've spent so much time screwing around with this, here was my final solution accompanied by some extremely foul language:

 

 

 

 

 

 

 

 

snapped.jpg

Man that felt good.  Rest assured that the fragments will be e-recycled properly.  My fear that was the e-recycler would just end up putting them up on ebay and some other unfortunate soul would end up with them.  Not gonna happen now...

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
Garrett_DirSec
Advisor

Hey @Timothy_Hall -- sorry for the frustrations and extended waste of cycles.   I'm glad you were able to definitively identify the problem hardware.   

I've run into issues with solder "whiskers" and "bad" solder joints that appeared to fail based on temperature.     With all the solder contacts on a single memory stick -- let alone a motherboard -- it's truly an incredible thing we don't deal with this type of thing more frequently.   The fab process for chip manufacturers must be a fascinating thing.   

 

0 Kudos
Tony_Graham
Advisor

For sure. Modern electronics stability is quite impressive.

Go engineers!

0 Kudos
Tony_Graham
Advisor

Interesting one chip from China the other from Mexico.

I notice also that they are mismatched pairs. One is 2133 speed while the other is 2400.

That can be a recipe for disaster. I would have tried one stick at a time and also confirm which speed

the board is supposed to have. It's possible neither stick was actually bad. Having said that I find laptop

memory to be the least reliable.

0 Kudos
Garrett_DirSec
Advisor

Excellent catch @Tony_Graham .   agreed.  that's a mismatch and recipe for disaster.     

However, I would ask why the mobo would not simply run both at the slowest speed (ie. you would think the 2400 module could run at the 2133 speed)? 

0 Kudos
the_rock
Legend
Legend

O yea, laptop memory never let me down. Firewall one, thats a different story 😂😂

0 Kudos
Bob_Zimmerman
Authority
Authority

With anything remotely modern (and anything which can run DDR4 is definitely modern enough), speed mismatches aren't a problem. The controller will just train both channels at the lower speed.

I don't think I knew Micron did any assembly in Mexico. I wonder where the chips came from. They don't have any fabs anywhere near Mexico last I checked. Fab 6 in Virginia is probably the closest, but they do their own assembly. They have a bunch of fabs in Taiwan, but again, they do assembly locally.

0 Kudos
Tony_Graham
Advisor

Probably somewhat depends on the quality of components used. I recently had an HP laptop exhibit the same issues. I tested each module individually since the laptop would not even boot into Windows without bluescreen. It ran fine on one stick but with the other it would not boot. I also tested on different DIMM slots just to be sure the slot itself was not at fault.

0 Kudos
Timothy_Hall
Legend Legend
Legend

Agreed Bob, speed mismatches really shouldn't matter given that combo of memory worked just fine for about 4 years or more.  The motherboard they were plugged into could only go 2133, so I suspect the 2133 module was older and based on my records probably purchased in 2017.  The 2400 module appears to have been purchased in 2020 or so, which is a frequent occurrence as the older 2133-only modules became scarce and expensive when faster modules became available.  My guess is that the older China 2133 module which ran at maximum speed its whole life is the one that failed; didn't notice the country of assembly difference but if I had, I probably would have had a grudge match between the two modules I snapped to see which one (and country) was actually the bad one.  Didn't really care at that point, guilt by association in my book...

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
0 Kudos
Vladimir
Champion
Champion

When it comes to the labs on laptops, I've had a great experience with Dell Precision 5510s equipped with Xeons and ECC RAM. Never had any stability issues, until my wife dropped it from the desk:). Same model with Core i7 and no ECC RAM is a lot more fickle.

Have you considered running labs on nested ESXis running on Dell r650s servers loaded with RAM to capacity?

 

the_rock
Legend
Legend

For me, EVE-NG is the law for the labs, so easy and convenient.

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events