HTTPS Categorization ... a drama

Matt_Taber · ‎2018-06-21

Thought I would share a situation we dealt with yesterday and into the wee early morning hours of today.

We started receiving reports that users were having intermittent issues with multiple google.com sites (drive, cse, apps, etc.). While troubleshooting with curl and openssl on the cli, we discovered that the issue was the app/url blade was dropping connections (erroneously) on some Google IP addresses(dropped by: fwpslglue_chain Reason: PSL Drop: ASPII_MT in fw ctl zdebug drop), while correctly passing the traffic on others.

For those that don't know, with HTTPS Categorization (not Inspection) the determining factor that the gateways use for a permit or drop is the CN in the certificate that is returned from the server. It does not (supposedly) rely on IP addresses. It doesn't see the FQDN being requested as it's encrypted traffic.

After hours (and multiple shift changes) working with support there was still no solution in sight. We rolled back APP/URL policies to a know good date to no avail. We failed over, we rebooted, and failed back over. We cleared out the APP/URL local cache on the gateways and set them to clear on policy installation. Running debugs on app/url, and rad, etc......

We performed packet captures on both the working and non working traffic. We pulled the cert (*.google.com) out of the capture and verified the certs were identical in both the working and not working captures. We, as well as CP support were surprised that the certs were identical. If the CN is the deciding factor, why would an identical cert behave differently based on what IP address the FQDN was resolving to? We thought for sure the cert would be different.

At this point we debugged wstlsd while still running curl/openssl tests. To our surprise, none of our test traffic was showing up in the debug. Finally, this tipped of our support rep and the magic command came was issued:

fw tab -t cptls_server_cn_cache -x -y

"The cache saves mapping between IP+Port to CN (Certificate's Canonical Name) and a flag if the CN is valid. The table will go up to 10,000 entries and be cleared automatically to make room for new entries."

There's not much in the support portal regarding this cache. Only pertinent match was: sk120775

Eureka, after running this command we were fully functional on all the Google IPs we were testing. There either had to be:

1) A miscategorized association between IP+Port & CN

2) A corrupted entry for the above

After we resolved this it was apparent why a failover/reboot/failover did not resolve anything, the connections/caches all stayed in sync/active with the invalid information.

After so many hours (11 I think?) chasing this down, we did not pursue a RCA on why the cache entries were victimizing us. We were just glad we found them and could go to bed.

Hopefully someone will run across this article and it saves their bacon (and 11 hours).

TL;DR: HTTPS Categorization doesn't use IP addresses directly for categorization purposes, but it sure does cache them.

PhoneBoy · ‎2018-06-21

Thanks for sharing your experience here.

It would make sense that we cache the name/IP lookup.

We do something similar for HTTPS Inspection and bypass rules.

Vladimir · ‎2018-06-22

It'll be useful to have the process of HTTP/S connections diagram depicting everything that may affect them, caches included. It probably wouldn't hurt to have a WebUI options of viewing as well as clearing them, with usual disclaimers.

Nothing against CLI when it is called for, but when TAC spending hours or days hunting issues down, a little simplification may be a better option.

Albert_Wilkes · ‎2018-07-02

thanks for sharing the experience and the diagnostics. I'd like to offer another possible explanation: The IP address might have been associated with a different hostname before. Google might have just repurposed it?

Indeed categorization is naturally fuzzy when it comes to blocking traffic, particularly with sites where you have multiple CN's in one certificate. Categorization won't be able to know which of the sites (CN's) the client will eventually be requesting once the session has been established. I recommend trialling using the CP as a proxy or having a proxy in a DMZ as discussed in this article:

HTTPS inspection real life examples and caveats in R77.30 and R80.10

The use of a proxy would allow the firewall to learn the hostname "directly" from the CONNECT request, rather than using a fuzzy/ambiguous surrogate from the response which e.g. might well have multiple CN's in them like I demonstrate in the article.

Paul_Hagyard · ‎2019-09-12

Running as an explicit proxy works but:

It runs (or did, unsure if this is still the case) in the user space, so performance is not as good as transparent where it's in the kernel space (although performance doesn't matter if it doesn't work 🙂 ).
The default (read reliable) configuration for the engine settings is to fail open (allow traffic) when something goes wrong internally. I've seen instances where undesirable traffic was allowed because of a so-called engine failure.
The default (read reliable) configuration for the engine settings is to allow traffic until categorisation completes. I broke my home network setting this to hold. It got into some race condition during a restart of the network - DNS through the firewall didn't work because the firewall couldn't access the categorisation because the DNS wasn't allowed...

Load balanced Squid is one option, potentially using ICAP to have the retrieved files passed through the Check Point gateway's Threat Prevention Anti Virus and Threat Emulation.

Daniel_Taney · ‎2018-10-10

Thank you for this writeup! I think this could pertain to an issue I'm seeing on one of my GW's. Can the fw tab -t cptls_server_cn_cache -x -y command be run on the GW without an interruption to traffic? Or should something like this be done off-hours?

TIA!

-Dan

R80 CCSA / CCSE

Matt_Taber · ‎2018-10-11

My understanding is this command flushes the categorization association to IP addresses. I have run this during production hours without incident, since it is mainly web browsing.

Jonathan_Griffi · ‎2019-05-28

Hi,

I have just experienced the same issue on a R77.30 gateway. Clearing down that table/cache worked a treat. Thank you for this information, you have just saved me a lot of tshooting.

Cheers,

Jon

Kurpeus · ‎2020-02-27

Gents

Does this table this exists ? i get an error message when trying to query it

[Expert@FW01:0]# fw tab -t cptls_server_cn_cache -s
HOST NAME ID #VALS #PEAK #SLINKS
Failed to get table status for cptls_server_cn_cache

or was this replaced by cptls_host_name_cache ?

Timothy_Hall · ‎2020-02-27

If you are on R80.30, HTTPS/TLS Inspection was reworked in that version to some degree (including a new TLS parser) and may account for that table no longer existing or being renamed; could also be related to the primary use of SNI instead of the certificate's server name for categorization purposes.

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization

Are you a member of CheckMates?

HTTPS Categorization ... a drama