Check Point CPAP-SG3800 and expected performance l...

RamGuy239 · ‎2022-01-14

Greetings,

I'm trying to squeeze out as much performance using a Check Point CPAP-SG3800 as possible. It's currently running R81, but we might move it to R81.10 if there are some performance benefits to be had.

The issue at hand here is that the customer wants to have two LACP bonds with three members in each bond. This is not due to redundancy, it's strictly about performance and load sharing.

This firewall is running firewall blade only, and they are going to have a lot of 4K video streams passing the firewall. They have based their expectations on the 3800 datasheet which claims:

https://www.checkpoint.com/downloads/products/3800-security-gateway-datasheet.pdf

Claiming 3.6 Gbps throughput and 2.75 Gbps IP-sec VPN throughput (AES-128).

The design is not ideal from the get-go. Going for 3x 1Gbps LACP for achieving 3Gbps is not the best way of handling things, I would much rather have 10Gbps interfaces. But it is what it is I suppose. The bonding is running this configuration:

add bonding group 1
add bonding group 2
add bonding group 1 interface eth1
add bonding group 1 interface eth2
add bonding group 1 interface eth3
add bonding group 2 interface Mgmt
add bonding group 2 interface eth4
add bonding group 2 interface eth5
set bonding group 1 mode 8023AD
set bonding group 1 lacp-rate fast
set bonding group 1 mii-interval 100
set bonding group 1 down-delay 200
set bonding group 1 up-delay 200
set bonding group 1 xmit-hash-policy layer3+4
set bonding group 2 mode 8023AD
set bonding group 2 lacp-rate fast
set bonding group 2 mii-interval 100
set bonding group 2 down-delay 200
set bonding group 2 up-delay 200
set bonding group 2 xmit-hash-policy layer3+4

The customer is also involving Jumbo Frames, so every interface that is a part of bond2 is running MTU 9216. Bond1 which only contains the Internet link is running MTU 1500. Currently, there is no traffic on the site that is running Jumbo Frames on the client/server so in practice all current traffic is MTU 1500.

The 3x 1Gbps LACP bond that makes up the Internet link has a 10Gbps connection to the ISP and the ISP ensures 10Gbps speed within the customer WAN between their main locations. So IP-sec VPN going from one of their main sites to another main site is supposed to have 10Gbps throughout the whole chain with the support of using MTU 9216.

Internal VLAN to VLAN traffic is almost reaching 3 Gbps throughput and is working almost better than expected all things considered. But traffic over IP-sec VPN is hitting a brick wall at around 500-650 Mbps. This is where our real issues begin.

I know the datasheet is some kind of "best-case scenario", but the customer expects it to be higher than this. What makes it even worse is that latency on the entire firewall quadruples once these 4K video streams are passing over IP-sec VPN so even when we have almost idle SND's and fw workers with low load, all traffic is getting affected.

The VPN community is running:

Phase 1:

IKEv1
AES-256
SHA-256
DH Group 19

Phase 2:

AES-GCM-128
NO PFS

CoreXL Split / Dynamic Load Balancing is enabled and the appliance is running in USFW mode. When looking at load during these IP-sec sessions we can see the number of SND's increasing from 2 to 4, so the appliance is normally running 6+2, but once this heavy IP-sec traffic hits it's adjusting itself to 4 + 4. When looking at multi-queue all interfaces seems to stay at 2-queues all the time, even after the number of SND's increase during the load.

The problem is that this traffic seems to be locked to a single thread/CPU? Even when going from 2x SND into 4x SND there is always 1x SND stuck at 100% while the others barely see any load. The VPN daemon is supposed to be able to scale over multiple threads but it seems like we are hitting some kind of limit here?

When looking at TOP we can see that it's ksoftirqd/0 that is utilising 100% CPU.

I'm not entirely sure if we can expect this to improve by tweaking things? My thoughts were to disable CoreXL Split and simply put it at 2+6 and manually override multi-queue to have 6 queues on all interfaces with the hopes of this traffic being able to scale. But considering it doesn't seem to be able to scale with the current 4 + 4 I suspect that this won't really change anything?

The customer wants to have Jumbo Frames enabled on the IP-sec VPN. I suppose this might improve performance, as the ISP has Jumbo Frames enabled so in theory we should be able to have MTU 9216 for VPN traffic but this would require us to re-design as they also have IP-sec VPN with third parties. I suppose we would have to utilise VTI and set MTU 9216 on the VTI interface, while also having MTU 9216 on the internet link. Then the other VPN communities could still be using domain-based VPN and normal MTU? Seems like a tiresome process, and if we are limited by hardware I'm not sure it will do anything?

We could also lower the settings on the VPN community, but this should be most affected by phase 2 settings and it's already running AES-128-GCM and that's supposedly the best you can choose in terms of performance when being accelerated using AES-NI.

I might try to flip it from USFW into KMFW. Does anyone have any pointers here? It's the numbers from the datasheet that is putting pressure on us here and I have a hard time understanding how we would be able to achieve anywhere close to 2.75 Mbps via IP-sec VPN on this hardware if the traffic is being stuck to a single-threaded process. These are Intel Atom cores after all.

The gateway on the other end of the tunnel is running an 8-core open server licence, it's configured with 4 + 4 and the interfaces have 4-queues via multi-queue. And its Intel Xeon CPU and AES-NI acceleration does not seem to be any kind of bottleneck. When checking CPview and TOP when the heavy VPN traffic hits there is nothing pointing at this gateway having any limiting factor, at least not at our current 500-650 Mbps throughput. The NIC is on the HCL-list and it's the recommended Intel driver known to work great with Check Point GAiA. This one is also running R81.

Does anyone have any pointers here?

Certifications: CCSA, CCSE, CCSM, CCSM ELITE, CCTA, CCTE, CCVS, CCME

Wolfgang · ‎2022-01-14

@RamGuy239

Correct me if I‘m wrong… You have only one IPSEC-tunnel with another remote-gateway and all your needed streaming traffic will be Running through this tunnel?

If yes, you‘re really fine with your actual throughput. You can get the mentioned VPN throughput, but this is not possible running only one IPSEC-tunnel. Another point will be you need more then one connections running these streaming traffic. Meaning multiple sources and multiple destinations. For one single connection there will be always a limitation.
If you want to have higher throughput you have to change your VPN design. This problem could not be solved with a better appliance or more interface via LACP or AES-NI or changing the IPSEC parameters etc.

have a look at my older discussion max performance / throughput of site2site-VPN

RamGuy239 · ‎2022-01-16

Hi, @Wolfgang

Thank you for your feedback. This is a single Site-2-Site IP-sec VPN tunnel, and these 4K streams are going to run via this tunnel.

This environment is not in full production yet, so thus far the performance is being measured using 30 sessions within iperf. The src and dst IP is the same for all 30 sessions, but it does create 30 separate connections. This is obviously not equivalent to how the 4K streams are going to work in production but should different sessions be capable of creating their own connections? Like I said, the LACP aggregation works just fine when doing the test from one local VLAN to another.

Certifications: CCSA, CCSE, CCSM, CCSM ELITE, CCTA, CCTE, CCVS, CCME

Timothy_Hall · ‎2022-01-15

OK definitely a lot to unpack here, thanks for all the detail.

1) Because you have only the Firewall blade enabled, all VPN operations are happening on the SND/IRQ cores. As such converting from USFW to KMFW won't have any effect as that transition only affects worker/instance cores which aren't the bottleneck. SND is already in the kernel.

2) Your VPN algorithms are set optimally, not much to do there.

3) As you have observed one of your SND cores is getting saturated which sounds like the bottleneck you are currently hitting. I know there was a bug awhile back where all Remote Access VPNs would get improperly concentrated on a single SND core (this reared its head at the start of COVID when everyone was working from home) but that was fixed long ago and I don't think applies to site-to-site VPN traffic (sk165853: High CPU usage on one CPU core when the number of Remote Access users is high). Check out the following which might be helpful in your situation, although I can't figure out by reading it if this fix will apply to your fully-accelerated VPN traffic or not since it references the Medium Path: sk175890: SND connection outbound distribution issue when running VPN

4) Please provide the output of cat /proc/cpuinfo taken on the 3800. If SMT/Hyperthreading is supported and enabled, this is one of those rare situations where turning it off will help, possibly quite a bit. If the 3800 has 4 physical cores, with SMT enabled that brings us up to 8 CPU threads. In this case one physical processor (threads 0 & 4) is handling both SNDs and the two SND instances are stepping on each other to reach the busy single physical core. Disabling SMT would put you back in a 1/3 split initially and there would be a fully dedicated physical core to handle SND duties and it should be able to go faster; Dynamic Split should eventually settle you into a 2/2 split under heavy load or you could statically set that split if desired.

5) Please provide output of netstat -ni to see if your bottleneck is a NIC card/driver, there are probably some RX-DRPs going on if your SNDs are saturating.

6) Even if properly implemented I don't think Jumbo Frames will buy you a lot here, I'd investigate the above items and consider Jumbo Frames as the final step. If you have inconsistent MTU sizes in the network path trying to do Jumbo Frames will really kill VPN performance.

Attend my online "Be your Own TAC: Part Deux" CheckMates event
March 27th with sessions for both the EMEA and Americas time zones

RamGuy239 · ‎2022-01-16

Hi, @Timothy_Hall.

I will provide you with more detailed data when I'm back at work tomorrow. The CPAP-SG3800 features an 8-core Intel Atom CPU, there is no SMT/Hyper Threading support on Intel Atom. It's "real cores", but it's Atom so it's rather old architecture with limited performance. It's this model:

https://www.intel.com/content/www/us/en/products/sku/97926/intel-atom-processor-c3758-16m-cache-up-t...

When watching netstat -ni there is no doubt that once the VPN traffic starts and the SND / ksoftirqd/0 starts getting hammered the amount of RX-DRP is increasing by a lot.

Certifications: CCSA, CCSE, CCSM, CCSM ELITE, CCTA, CCTE, CCVS, CCME

Timothy_Hall · ‎2022-01-16

Yeah if I remember correctly Atom is roughly 2-3 times slower per-core than a Xeon due to its ultra low voltage architecture, which Intel tries to make up for by having more cores, which isn't really going to help you here even though it does support AES-NI.

Assuming sk175890 doesn't help I'm thinking you are kind of stuck here. Since you are in a lab environment you could try these last resorts:

1) With your default 2/6 split disable handling of VPNs by SecureXL with the vpn accel off command. Normally you would not want to do this, but this may allow VPN traffic handling to be spread across multiple worker cores (sk118097: MultiCore Support for IPsec VPN), even though the workers will handle it less efficiently than an SND. Might help, might hurt, might be a wash. You'll have to try it.

2) With SecureXL handling of VPNs enabled (vpn accel on), set a static CoreXL extreme split of 6/2 and hope that Multi-Queue can spread the VPN traffic across the 6 SND cores. However depending on the type of NICs in the 3800 they may only support 2 or 4 queues which will prevent their traffic from getting spread across all 6 SND cores, but it is worth a try.

Attend my online "Be your Own TAC: Part Deux" CheckMates event
March 27th with sessions for both the EMEA and Americas time zones

RamGuy239 · ‎2022-01-17

@Timothy_Hall

That's some nifty and creative suggestions. It's very easy to test with vpn accel off. Wouldn't vpn accel off also disable the use of AES-NI? Or is AES-NI engaged outside of SecureXL?

I've had a meeting with the customer. Their expectations have been lowered, but they want to achieve 1Gbps on the IPsec VPN connection so I have to try tweaking it so it might get those additional 300 Mbps throughput over IPsec VPN.

02:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
05:00.0 Ethernet controller: Intel Corporation Ethernet Connection X553 1GbE (rev 11)
05:00.1 Ethernet controller: Intel Corporation Ethernet Connection X553 1GbE (rev 11)
06:00.0 Ethernet controller: Intel Corporation Ethernet Connection X553 1GbE (rev 11)
06:00.1 Ethernet controller: Intel Corporation Ethernet Connection X553 1GbE (rev 11)

Sadly, when it comes to Multi-Queue this becomes rather messy on the 3000-series. Check Point is using a combination of a dedicated Intel X553-T4 network interface, and the on-board I211 from the Intel Atom chipset. So this is a mix of igb and ixgbe drivers, the I211 is using igb 5.3.5.20, the X553 is using ixgbe 5.3.7 (V1.0.1_ckp). The former only support 2-queues, the latter supports 16-queues (limited by the driver, Intel claims it supports 64 queues).

Mgmt and eth5 are I211, both are part of bond2. I suppose this should be switched around, as all local interfaces, all VLAN's etc are on bond2 and bond1 is using a flat subnet containing the external network it would make more sense to have the limited interfaces as a part of bond1 and not bond2. This also makes it uneven when it comes to the number of queues supported by the nics within the same LACP bond.

I'm not sure where SND / ksoftirqd/0 kicks in this equation? Traffic flow will best src --> vlan on bond2 --> fw --> bond1 --> ISP. When IPsec VPN is what's causing the SND to hit a brick wall, would we want the X553 ports that can utilise more than 2x MQ to be in bond1 or bond2?

Certifications: CCSA, CCSE, CCSM, CCSM ELITE, CCTA, CCTE, CCVS, CCME

RamGuy239 · ‎2022-01-17

When it comes to sk118097 - MultiCore Support for IPsec VPN it seems like each Site-2-Site IP-sec VPN will get linked with its own CoreXL FW instance. Shouldn't that mean per subnet pair, or even per host should be able to balance this in some way? You get a new SPI per IPsec SA, so having IPsec SA per subnet pair or per-host should result in quite a lot of IPsec SA with different SPI's that can be linked to different workers?

The tunnel is using large subnets, so I suppose we should make sure to have supernetting disabled and make the encryption domain into smaller subnets to have additional SPI's getting created.

But it also claims that IPsec VPN Multicore is linked to CoreXL FW instances. That should mean fw_workers and not SND? Which makes it sound like this wouldn't really make much of a difference as the issue is related to the SND getting overloaded. The fw_workers are barely seeing any CPU in the first place. I'm not sure how the IPsec Site-2-Site is being handled in the kernel on R81+? Why would the SND see a much higher load when the traffic is leaving over IPsec VPN when compared to traversing locally from VLAN to VLAN? For the SND this should be pretty much the same if the IPsec VPN part of the equation is being handled by fw_workers and not SND?

Certifications: CCSA, CCSE, CCSM, CCSM ELITE, CCTA, CCTE, CCVS, CCME

RamGuy239 · ‎2022-01-17

sk175890 - SND connection outbound distribution issue when running VPN looks very promising. But I've tried to run "fw ctl get int cphwd_medium_path_qid_by_cpu_id" on R81.10, R81 and R80.40. Both VEN/VMware ESXi gateways and appliances. They all return:

Get operation failed: failed to get parameter cphwd_medium_path_qid_by_cpu_id

I tried to ignore it and run:
echo "cphwd_medium_path_qid_by_cpu_id=1" >> $FWDIR/boot/modules/fwkern.conf

But it's still returning
Get operation failed: failed to get parameter cphwd_medium_path_qid_by_cpu_id

So I highly doubt that it made any difference. The SK claims to be relevant for pretty much any kind of installation ranging from R80.20-R81.10 so this seems a bit strange.

fw ctl set int cphwd_medium_path_qid_by_cpu_id 1
Set operation failed: failed to get parameter cphwd_medium_path_qid_by_cpu_id

Certifications: CCSA, CCSE, CCSM, CCSM ELITE, CCTA, CCTE, CCVS, CCME

G_W_Albrecht · ‎2022-01-17

sk43387 tells us that this the system is functioning as designed. It gives another syntax for R80.20 and higher:

fw ctl get int <Name of Kernel Parameter> -a

But the result is the same. I have given feedback to sk175890 asking why this dos not work...

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

RamGuy239 · ‎2022-01-18

It's a new feature that is being rolled out with upcoming jumbo hotfix releases, seems like the SK got published prematurely. I'm going to move the gateways to R81.10 + JHF Take 30 and receive a hotfix to have the feature available to test it and see if it makes any difference.

Another thing regarding the use of 802.3AD (LACP) in Layer 3+4 mode for aggregated performance. I suspect it's common for this to work internally, but not for Site-2-Site IP-sec VPN traffic? When doing iperf internally using 15-30 connections the test is using the same destination IP on another VLAN, but each connection is using a different port.

This results in great aggregated performance, almost achieving 3Gbps throughput with 3x 1Gbps port in the LACP-bond. From what I understand Layer 3+4 is using dst+port for its hash, so the internal testing is capable of getting connections balanced using all three ports in bond2.

This won't be the case with IP-sec VPN traffic leaving bond1? All traffic will be dst: peer1, port/prot: esp resulting in all the IP-sec traffic using the same hash, limiting it to a single port in the LACP-bond? I suppose the traffic going the other way will have a different hash as it's will be based on dst: peer2, port/prot: esp and might get linked to another port in the bond, but we won't have any control over any of this. In other words, we won't be able to get above 1Gbps for IP-sec VPN traffic heading to a specific peer no matter what we do as a result of the network design?

The only way to get around this would be to replace the CPAP-SG3800 with a CPAP-SG6xxx with 10Gbps interface?

Certifications: CCSA, CCSE, CCSM, CCSM ELITE, CCTA, CCTE, CCVS, CCME

_Val_ · ‎2022-01-17

This all depend on what part of VPN functionality you talk about.

Within an established tunnel, VPN encryption/decryption is usually done by SecureXL, i.e. SNDs.

Timothy_Hall · ‎2022-01-17

AES-NI can be invoked both by SecureXL and the firewall workers.

Yes, definitely avoid those I211 NICs for any interface that needs to carry a large amount of bandwidth due to the queue limitations. This was called out in the third edition of my book, p. 192. Try to keep all VPN traffic flows on the X553's as much as possible, I'd try to avoid mixing and matching X553 and I211 in the same bond if possible, to the point of reducing the number of physical interfaces in both bond1 and bond2 if necessary, using the I211s just for management traffic.

To help spread the VPN traffic on the workers I'd do pair of subnets, and if it isn't balanced enough try pair of hosts although that can result in a lot of IPSec Phase 2 tunnels. CoreXL FW Instance and what I call the firewall worker are the same thing.

The SND would see a higher load when it is handling VPN operations because it is having to handle SoftIRQ from the NICs, SecureXL processing and also VPN processing. Assuming the VPN processing is causing most of the overhead, transitioning that to the workers should help significantly lighten the load on the SNDs and keep frames from getting lost due to RX-DRPs under heavy load.

Looks like the cphwd_medium_path_qid_by_cpu_id variable is only available in the very latest ongoing Jumbo HFAs (such as the one for R80.30) and is not widely available in the other Jumbo HFAs yet.

Attend my online "Be your Own TAC: Part Deux" CheckMates event
March 27th with sessions for both the EMEA and Americas time zones

RamGuy239 · ‎2022-01-17

Nice catch. We are running R81 with on-going JHF 51 but that does not seem to include it yet. Strange thing to have the SK without mentioning the JHF requirements. Planning to move it to R81.10 but we will hold on to see if this feature arrives for R81 or R81.10 first so we can test it.

Certifications: CCSA, CCSE, CCSM, CCSM ELITE, CCTA, CCTE, CCVS, CCME

Timothy_Hall · ‎2022-01-17

You could also try contacting TAC, they may have an individual hotfix you can apply to add this functionality that hasn't been rolled up into the Jumbo HFA yet.

Attend my online "Be your Own TAC: Part Deux" CheckMates event
March 27th with sessions for both the EMEA and Americas time zones

RamGuy239 · ‎2022-01-17

I've already done so.

I've also noted this as a part of the R81.20 EA release notes:

Major performance and stability improvement for Remote Access and Site to Site VPN that delivers a much higher capacity for VPN tunnels.

But this could mean anything I suppose. Capacity could mean that a gateway is capable of handling additional IP-sec communities and not having anything to do with performance. But the wording makes it sound like R81.20 might provide something great for this scenario when it uses terms like major performance and much higher capacity.

Certifications: CCSA, CCSE, CCSM, CCSM ELITE, CCTA, CCTE, CCVS, CCME

Are you a member of CheckMates?

Check Point CPAP-SG3800 and expected performance levels