Concurrent Conection vs high cpu

Marcelo_Fontana · ‎2022-01-11

Hello everyone,

A doubt today I have a 23500 cluster, with the concurrent connection parameter set at 500k.

The question is, if in a certain period of the day the connections exceed this value, for example from 02 p.m to 04 p.m reach 600k or 700k. Is it normal behavior for the firewall to flatten the CPUs at 100%?

I know that I can leave the configuration on automatic or increase it as needed.

I want to understand the relationship in limiting the connection number and this value being exceeded, with high CPU consumption in this period.

It's a behavior that I've been observing, normally this number is 200k, at times it exceeds 500k with that I noticed the high consumption of CPU.

the_rock · ‎2022-01-11

Normally, you should leave that setting as "automatic", as I believe that actually lets gateway calculate connections based on memory, cpu, utilization, capacity. Now, if that number fluctuates significantly, I would try find out why thats happening. Is there load on specific interface, where are those connections coming from?

Marcelo_Fontana · ‎2022-01-11

We know which hosts are causing the connection spikes. The question is if the limit is reaching would cause high CPU consumption?

Bob_Zimmerman · ‎2022-01-11

That depends entirely on what the traffic is doing. Basic firewall rule matching is extremely efficient, so additional connections don't cost much processor time. A 23500 is two E5-2660v3 chips. I'd be surprised if 400k new connections per second caused more than ~10% processor usage across all CoreXL worker cores if you have SecureXL enabled.

In contrast, deep inspection for IPS can be extremely expensive. Even 200 high-volume connections with performance-intensive IPS protections enabled could absolutely bring that box to its knees.

Marcelo_Fontana · ‎2022-01-13

Yes SecureXL is enabled as well as IPS.

At the time of disabling IPS, the problem continued.

This firewall is very stable, only at these peaks that I see this problem, when concurrent connections exceed 500,000, I see the increase in CPU, even disabling the IPS the problem continues.

Only after concurrent connections drop below 500,000 do I see processing improve.

I can change this parameter to automatic without any problems, I just wanted to understand the relationship of this 500k limit with high CPU.

Bob_Zimmerman · ‎2022-01-13

Having connections around costs memory, but nothing else. Only processing traffic on those connections costs processor time.

Do you have aggressive aging enabled? If so, what are the thresholds? It's configured under IPS, but I don't think it counts as an IPS protection when you disable IPS. I don't think I've ever seen a firewall with aggressive aging hit the configured connection limit, though.

If you have high processor usage when the limit is hit and even with IPS disabled, I suspect you have some clients which are retrying extremely rapidly. Can you tell what the firewall thinks is consuming the processor time? cpview has the ability to log processor and interface data over time, and I think it's enabled by default on newer versions. That may let you look back in time at what was happening during the high processor load.

_Val_ · ‎2022-01-13

What do you see when running Top? What exactly is taking CPU time?

Timothy_Hall · ‎2022-01-13

In my experience when the connections table is overflowing on the firewall it causes terrible performance on the gateway with some connections working and others not, and even web pages partially hanging as portions of them cannot be rendered because some separate connections for third-party content are failing.

This terrible performance may well be at least partially caused by high CPU utilization on the gateway. I don't specifically recall if CPU is high when this overflow condition is present, but I can certainly see it happening due to the following:

1) The first packet of a new connection arrives

2) It runs through the Firewall/Network Policy Layer in F2F for a somewhat expensive rulebase lookup and finds a matching rule for Accept

3) Since the connection table is full, the packet is dropped and no SecureXL Accept template is able to be formed because the traffic never left through the egress side of SecureXL

4) A retransmission arrives and goes through the same process of an expensive rulebase lookup in F2F and is dropped yet again

Depending on how far over the limit you are when the connections table is full I could see a situation where thousands of new packets a second are causing massive rulebase lookups with no opportunity of being efficiently matched by a SecureXL template. Keep in mind that usually the vast majority of packets crossing the firewall are associated with existing connections, and are matched and permitted with a quick state table lookup which has very low overhead. I could see these thousands of overflowing connections retransmitting and retrying over and over again causing a very large number of rulebase lookups, which would manifest itself as high CPU load on the firewall workers/instances but not the SND/IRQ cores.

Next time this happens, determine your CoreXL split and look at the CPU utilization on just the firewall workers/instances. If they are high but the SND/IRQ cores are not nearly as busy, I can pretty much guarantee that overhead is firewall rulebase lookups which only happen on the worker cores in F2F.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Are you a member of CheckMates?

Concurrent Conection vs high cpu