Thought I would share a situation we dealt with yesterday and into the wee early morning hours of today.
We started receiving reports that users were having intermittent issues with multiple google.com sites (drive, cse, apps, etc.). While troubleshooting with curl and openssl on the cli, we discovered that the issue was the app/url blade was dropping connections (erroneously) on some Google IP addresses(dropped by: fwpslglue_chain Reason: PSL Drop: ASPII_MT in fw ctl zdebug drop), while correctly passing the traffic on others.
For those that don't know, with HTTPS Categorization (not Inspection) the determining factor that the gateways use for a permit or drop is the CN in the certificate that is returned from the server. It does not (supposedly) rely on IP addresses. It doesn't see the FQDN being requested as it's encrypted traffic.
After hours (and multiple shift changes) working with support there was still no solution in sight. We rolled back APP/URL policies to a know good date to no avail. We failed over, we rebooted, and failed back over. We cleared out the APP/URL local cache on the gateways and set them to clear on policy installation. Running debugs on app/url, and rad, etc......
We performed packet captures on both the working and non working traffic. We pulled the cert (*.google.com) out of the capture and verified the certs were identical in both the working and not working captures. We, as well as CP support were surprised that the certs were identical. If the CN is the deciding factor, why would an identical cert behave differently based on what IP address the FQDN was resolving to? We thought for sure the cert would be different.
At this point we debugged wstlsd while still running curl/openssl tests. To our surprise, none of our test traffic was showing up in the debug. Finally, this tipped of our support rep and the magic command came was issued:
fw tab -t cptls_server_cn_cache -x -y
"The cache saves mapping between IP+Port to CN (Certificate's Canonical Name) and a flag if the CN is valid. The table will go up to 10,000 entries and be cleared automatically to make room for new entries."
There's not much in the support portal regarding this cache. Only pertinent match was: sk120775
Eureka, after running this command we were fully functional on all the Google IPs we were testing. There either had to be:
1) A miscategorized association between IP+Port & CN
2) A corrupted entry for the above
After we resolved this it was apparent why a failover/reboot/failover did not resolve anything, the connections/caches all stayed in sync/active with the invalid information.
After so many hours (11 I think?) chasing this down, we did not pursue a RCA on why the cache entries were victimizing us. We were just glad we found them and could go to bed.
Hopefully someone will run across this article and it saves their bacon (and 11 hours).
TL;DR: HTTPS Categorization doesn't use IP addresses directly for categorization purposes, but it sure does cache them.