Re: VSX Cluster Performance Issues

Roy_Smith · ‎2018-12-06

Hi

First off, I'm a fan of Checkmates and have learnt a lot from the site over the past few months. One particular discussion I cam across was regarding a similar issue we are now having. The discussion at Bad Performance goes through various configuration steps, which I have gone through myself. Some of these helped but we still have issues.

Before I raise a TAC call, I was wondering if anyone can advise on whether there are any other configuration tweaks I can look at. Also, whether these appliances are suitable or not.

So, we have 2 * 23500 appliances with R80.10 VSX installed. There are 11 Virtual Systems (2 switches and 9 gateways) the gateways are split with 5 active on HA01 and 4 active on HA02. 64-bit mode is enabled on all VS'es and I have configured CPU affinity for all the VS'es as well. The VS in question (VS2) has CoreXL enabled with 8 instances. Affinity has been set as

VS_2: CPU 3 4 5 6 23 24 25 26
VS_2 fwk: CPU 3 4 5 6 23 24 25 26

VS2 is active on HA01

This VS is now our default route to the Internet for all devices and users. Reports indicated there are approx 2500 users going through this VS, which should be about right, as we have approx 4000 staff in the organisation but they are not all in office at the same time. Concurrent connections is set to a max of 50000 and fw ctl pstat indicates connections fluctuate between 15000 to 20000 during the day.

The VS has App control, URL Filtering, AV, AB, IPS and IA blades enabled. HTTPS inspection was enabled. When it as the F2F/PXL split was approx. 60/40 and CPU time for the FWK2 process was between 200% to 300%. With HTTPS inspection disabled, the split is not 25/75 and CPU time is between 200% an 700%. Occassionally during the day, Smartdashboard loses connection to VS2 HA01, and it sometimes gets so bad that we cannot push policy. Other VS'es are fine.

During the day, the users start having issues where sites access starts getting slow, then they are unable to access sites. The browser says site is unreachable. If they wait 5-10 minutes, they can then access the site. This seems to coincide with very high CPU time (i.e. over 650%). During non-working hours, the CPU time is fine and performance is great. Sites load very quickly and there are no issues. The issues only appear during the day, a few hours after the working day starts.

As Mentioned, I have gone through most of what is in Barts discussion but need a sanity check.

For the number of users we have and blades enabled, is the 23500 with VSX suitable or not?

Why would disabling https inspection increase CPU time?

How do I find out what the FWK2 process is doing to generate such high CPU times?

What other troubleshooting can I do?

Apologies if this sounds long winded but if anyone can provide any assistance it would be greatly appreciated.

Many thanks

Roy

G_W_Albrecht · ‎2018-12-07

VSX is maybe the most complicated CP product, and knowledge and experience with it are worth a lot. So i would not expect too much help here. I would rather raise a SR# with TAC asap.

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

Daniel_Taney · ‎2018-12-07

Have you gone through your IPS profile and looked for signatures with High or Critical Performance Impact? I'd cherry pick through those and see if there are any you can disable.

As far as the CPU loads go... is that Internet facing VS the highest utilized VS? Since you are using VSLS, could you try shifting some of the other VS workloads over to the other GW?

R80 CCSA / CCSE

Roy_Smith · ‎2018-12-08

Daniel

All the high performance protections are already set to inactive. We did copy the Optimised profile and apply it to our policy. We have customised it with our own usercheck pages. Prior to sending all users through this VS, we cleared the staging , which set those in detect mode to Prevent, except for a few that we know are required for our users. Those protections originally set as inactive, are still set as inactive.

Would Prevent use any more resources than Detect?

This VS is the most heavily hit. All the others, are not using much CPU or memory. They generally stay very low but occassionally may get to 60-80% percent. The other VS'es have 2 Cores assigned and several share the cores, as they are very small indeed.

Timothy_Hall · ‎2018-12-07

No idea why disabling HTTPS Inspection would increase CPU time, unless somehow the HTTPS Inspection is somehow being handled somewhere where you aren't measuring CPU load.

I'd suggest looking at your APCL/URLF & Threat Prevention policy layers for optimization. There is a good chance that large amounts of traffic are getting pulled into PXL and expending large amounts of CPU while there.

If the slowdowns are happening often enough and lasting awhile, you can take a shortcut to help identify the cause of the slowdown. I'm pretty sure these commands will work in VSX as long as you select the correct context first. When the slowdowns start:

ips off

Wait 60 seconds then check performance, if it is suddenly much better you need to tune IPS

fw amw unload

Wait 60 seconds then check performance, if it is suddenly much better you need to tune your TP policy

ips on

fw amw fetch local

If these commands didn't dramatically improve performance, you need to look at your APCL/URLF policy next. Are you using "Any" instead of "Internet" in the Destination of any APCL/URLF rules? Big no-no. Do you have a cleanup rule at the end of the APCL/URLF policy like "Any Internet Any"? You don't need it assuming the implicit cleanup rule's action is the default of Accept.

There is no way unfortunately to turn off APCL/URLF on the fly like IPS/TP, only way to do that is to uncheck the product checkboxes on the firewall/cluster object and reinstall. So that might be your next thing to try.

--
Second Edition of my "Max Power" Firewall Book
Now Available at http://www.maxpowerfirewalls.com

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Roy_Smith · ‎2018-12-08

Timothy

I will certainly try those commands when we have the issues. We are using a unified policy with shared layers. One thing I'm trying to find out is how much extra load, if any, a shared layers would place.I would guess not much but I would like to get it confirmed.

Our policy uses either "Internet" or a specific destination. There may be 1 or 2 rules with any but one my next step is to look at the policy, tidy it up and probably strip it back, to see if it improves.

Timothy_Hall · ‎2018-12-09

As far as IPS, protections in Detect mode use more CPU than those in Prevent mode.

Inline/Shared/Unified policy layers don't usually make a huge difference as far as gateway performance when compared to ordered layers. However when utilizing inline layers it is recommended to specify only simple services (http, https, ftp, etc.) in the top-level parent rules, then invoke applications/categories in the sub-layers if possible. This helps maximize the effect of Column-based matching which can result in sizable rulebase lookup performance gains.

SIC error 148 is just a simple timeout, so that would just seem to indicate very high CPU load.

--
Second Edition of my "Max Power" Firewall Book
Now Available at http://www.maxpowerfirewalls.com

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

PhoneBoy · ‎2018-12-07

Are you running in 32bit VS mode or 64bit?

With a larger VS, you could probably benefit from increasing the amount of memory a VS can use (which is limited to 4GB in 32bit mode).

Roy_Smith · ‎2018-12-08

Everything is running in 64-bit. This was annoying when we first tested out the cluster it kept flapping but w=once we ran vs_bits 64, it was good after that. Memory is good as barely less than 50% of physical and virtual memory is being used.

Roy_Smith · ‎2018-12-08

Thanks guys for the replies. This issue gets a bit hairy, as we over 2000 users complaining about internet performance. Once thing I have noticed is that some one member, HA01 which is the active member for this problematic VS, is a having difficulty getting out to do the Ant-bit updates. The AV updates are fine and the other member is fine. I also see the same message

"Error: Update failed. Contract entitlement check failed. Could not reach "updates.checkpoint.com". Check proxy configuration on the gateway."

Appear on this VS and 3 other VS'es. The strange thing is the other 3 VS'es are active on HA02 but it is still HA01 that complains about the update. All other VS'es active on HA01 are fine. I had assumed through the week this was just down to the apparent resource issues I've been seeing. However, the update issue is still over the weekend when 99% of staff are not working. I've gone through the usual SK articles with no success. It may be some other tweak I made so I'm trying to backtrack what I have changed to see.

One other thing I noticed, is when the VS was struggling, we would see "sic error (148)" when trying to push policy, so this made making any changes very difficult to apply. I'm going through all these trying issues, as I believe they are symptoms of the same problem but I'm struggling to work it out.

TusharPatel · ‎2019-05-01

Roy,

Just wondering, How the performance issues resolved for you? As Timonthy commented, IPS in detect mode consumes more CPU then Prevent mode, Did you made those changes?

Are you a member of CheckMates?

VSX Cluster Performance Issues