Re: Question about bandwidth usage

Kryten · ‎2022-05-25

Hey there,

I got a question from one of our customers which I cannot really answer yet and I hope to get some help here again 🙂

The customer is complaining about low throughput from the Clients to the Internet and while most of the problems seem to be related to a poorly optimized ThreadPrevention Policy and high utilization of the appliance in general (high CPU load and Mem usage), there is one thing we cannot explain yet:

When the customer is testing the throughput via HTTP/HTTPS downloads from the Internet he gets about 20-30 Mbit/s for a single connection, which is not really much. When he opens ten of those connections simultaneously, he gets 200-300 Mbit/s in total.

According to the customer the involved interfaces have been checked and they are not reaching any limit.

Now the Question:
Why is one conection not using as much throughput as it can? There is no QoS used anywhere, so in my understanding a single connection should still use all available bandwith it can get. Is there a limit on how much throughput one deeply inspected connection can have in total?

Hoep to get some insight into this.

Cheers,

Alex

Chris_Atkinson · ‎2022-05-25

Are you using a recent JHF on R80.30 / R80.40?

PRJ-32072,
STRM-737

Security Gateway

UPDATE: Check Point Active Streaming (CPAS) TCP Window scale factor is now increased up to 6.

CCSM R77/R80/ELITE

Kryten · ‎2022-05-25

Its R80.40 with JHF Take 139.

Checking the release notes this has been added after 139, so might be a possibility.
The traffic in question is inspected by HTTPS Inspection and IPS (signatures with high performance impact are enabled).

G_W_Albrecht · ‎2022-05-25

This works as expected - single connections are restricted in speed as usually more than one connection is open. So more connections will add up to a higher thruput.

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

Kryten · ‎2022-05-30

Thanks for confirming! I am struggling to find anything about this though (everything I find points to QoS, which we don't use here), could you point me somewhere where this is explained?

G_W_Albrecht · ‎2022-05-30

I have had that issue pop up in a customers SR# some time ago, when R&D explained that a single process is (was?) not allowed to use all thruput he can get away with 🙂 Elephant flows can also be seen as part of this discussion.

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

Kryten · ‎2022-05-30

I was always under the impression that a single connection can use as much resources as there are left (beeing it cpu time or bandwidth), as long as there is no QoS active(we don't even have the blade active here).
I see that multiple concurrent connections would probably be spread out over the available CPU cores, so the total processing power would be more than what can be available for one connection. And with usage of IPS and Inspection(and a lot of other stuff going on on the GW) this is probably the first limit that is hit (and not the throughput of the ISP line).

What you describe sounds more like a hard limit configured somewhere though(which would explain the relatively constant numbers the customer gets with these tests), and if that is the case I would like to know about it and maybe even a way to change it 🙂

G_W_Albrecht · ‎2022-05-30

Yes, if TP scans the connections data stream i think this will happen on one fw_worker.

Maybe this is a good starting point -

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

Wolfgang · ‎2022-05-30

How about the hardware of your appliance, how many CPUs are in use?

You should really get more throughput then 30Mbit/s for a single connection. Running the appliance near the end of memory or swapping to disk will decrease the overall performance. You can try to disable ThreatPrevention or URLF/APPLC to debug down the problematic feature.

Do you use the gateway as proxy ?

It‘s no problem to get 200Mbit/s for a single connection, full HTTPS inspection and the default optimized ThreatPrevention profile on a 3800 appliance. Working CPU goes near 100% in this case but only one of the 8 cores.

Kryten · ‎2022-05-31

Its a HA Cluster of 6500 appliances, but they have a lot of blades active and a rather unoptimized TP Policy, which leads to high memory and cpu usage. I have the task now to sort that out and to see what we can optimize to make things better.

I guess a ot of these problems come from this shortage of cpu and mem, but I have a hard time understanding why its not possible to get the same throughput for a single connection that we get with multiple concurrent ones from the same host to the same target.

To make things worse, this is a productive setup in the medical sector, so most of the time it is not possible to just turn off or change things for troubleshooting.

Timothy_Hall · ‎2022-05-31

Because you are on R80.40, you'll need to do some manual tuning. Please provide the output of enabled_blades and the Super Seven commands: S7PAC - Super Seven Performance Assessment Commands Some general tuning based on the outputs of these commands could be very helpful, especially if these connections are winding up in F2F/slowpath. Most of the tuning happens automatically now in R81+.

As mentioned earlier in this thread if you have a single elephant/heavy connection it can only be handled on one worker core (PSL or CPAS), although Hyperflow is planned to address this in R81.20. Detection & mitigation strategies for elephant flows are here:

sk164215: How to Detect and Handle Heavy Connections

sk122013: Handling heavy connections in CoreXL

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Kryten · ‎2022-06-01

I guess most of those commands I have already used at some point while working through the MaxPower Book 🙂

I have attached the S7 output here, thanks a lot for having a look!

enabled_blades:
fw vpn cvpn urlf av appi ips identityServer SSL_INSPECT anti_bot ThreatEmulation content_awareness Scrub

The RX-DRP rates have already been checked and they come from unknown protocol drops (RRCP broadcasts from a Switch).

And then we have a seriously screwed TP Policy which I have to clean up soon. First rule has "Scope:any" and only IPS activated. Later rules (with the other TP blades active) do never match that way it seems. Also, IPS protections have never been managed/checked, so we have close to 12000 in Staging/Detect mode and only about 650 in prevent. All this while cheking for protections up to a high performance Impact....happy times 😄

Timothy_Hall · ‎2022-06-01

Your F2F is a tad high at 29%, if those 30Mbit connections are getting pulled into F2F that could explain the bottleneck you are seeing. In addition your worker cores are all fairly busy but well balanced at only about 25% idle; there is not a lot of headroom available if an elephant flow shows up which could also explain the bottleneck.

I'd sort the IPS protections by Performance Impact and look at any that are Critical and disable them if you can, also under Threat Tools...Protections highlight each one and check the Performance Impact rating on the Summary tab; try to not have any Criticals enabled if possible. That should help a lot and give you some more headroom on the workers by dropping the F2F percentage. You can confirm that it is TP causing the issue and get a preview of how much there is to be gained by temporarily disabling TP as mentioned on pages 350-351 of the third edition of my book. Always good to confirm that you are looking the right place before trying to tune things.

Also make sure in any Access Control layers that are doing APCL/URLF and/or Content Awareness that the Destination is always Internet and never Any to keep traffic from getting inappropriately pulled into the Medium Path, but you are probably already doing that based on the path percentages you posted.

Since you have HTTPS Inspection enabled, watch out for traffic getting pulled into active streaming when it doesn't need to be. See my post here about HTTPS Inspection Policy Optimization which can make a big difference on R80.40 and earlier: HTTPS Inspection Policy Rule Order

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Kryten · ‎2022-06-02

Big thanks for the advice, a few things here to be considered that I was not aware of.

For the IPS we already only use protections with performance impact high or lower, so no critical impact. As far as I understand it, this still puts around 50% of the traffic in the F2F path.
Also I am pretty sure that there are at least a few occurences of "any" in some APCL/URLF rules left, as I saw some warnings regarding this in the policy install history, thats something I will try to clean up as well then.

What I also saw is that there is quite some load during the day generated by httpd processes working for the usercheck portal, which is used a lot here. So my next task will be to find out if that could be moved to an external server.

I am not sure when we will have sorted these things out, as this depends on the customer, but I will update this Thread again once we are done and let you know how much of a difference these things make. Thanks again!

Are you a member of CheckMates?

Question about bandwidth usage