First off, I'm a fan of Checkmates and have learnt a lot from the site over the past few months. One particular discussion I cam across was regarding a similar issue we are now having. The discussion at Bad Performance goes through various configuration steps, which I have gone through myself. Some of these helped but we still have issues.
Before I raise a TAC call, I was wondering if anyone can advise on whether there are any other configuration tweaks I can look at. Also, whether these appliances are suitable or not.
So, we have 2 * 23500 appliances with R80.10 VSX installed. There are 11 Virtual Systems (2 switches and 9 gateways) the gateways are split with 5 active on HA01 and 4 active on HA02. 64-bit mode is enabled on all VS'es and I have configured CPU affinity for all the VS'es as well. The VS in question (VS2) has CoreXL enabled with 8 instances. Affinity has been set as
VS_2: CPU 3 4 5 6 23 24 25 26
VS_2 fwk: CPU 3 4 5 6 23 24 25 26
VS2 is active on HA01
This VS is now our default route to the Internet for all devices and users. Reports indicated there are approx 2500 users going through this VS, which should be about right, as we have approx 4000 staff in the organisation but they are not all in office at the same time. Concurrent connections is set to a max of 50000 and fw ctl pstat indicates connections fluctuate between 15000 to 20000 during the day.
The VS has App control, URL Filtering, AV, AB, IPS and IA blades enabled. HTTPS inspection was enabled. When it as the F2F/PXL split was approx. 60/40 and CPU time for the FWK2 process was between 200% to 300%. With HTTPS inspection disabled, the split is not 25/75 and CPU time is between 200% an 700%. Occassionally during the day, Smartdashboard loses connection to VS2 HA01, and it sometimes gets so bad that we cannot push policy. Other VS'es are fine.
During the day, the users start having issues where sites access starts getting slow, then they are unable to access sites. The browser says site is unreachable. If they wait 5-10 minutes, they can then access the site. This seems to coincide with very high CPU time (i.e. over 650%). During non-working hours, the CPU time is fine and performance is great. Sites load very quickly and there are no issues. The issues only appear during the day, a few hours after the working day starts.
As Mentioned, I have gone through most of what is in Barts discussion but need a sanity check.
For the number of users we have and blades enabled, is the 23500 with VSX suitable or not?
Why would disabling https inspection increase CPU time?
How do I find out what the FWK2 process is doing to generate such high CPU times?
What other troubleshooting can I do?
Apologies if this sounds long winded but if anyone can provide any assistance it would be greatly appreciated.