Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
mstraub
Participant
Jump to solution

VPN performance limits

Hi All,

We are struggling with optimizing VPN performance in 10GBit/s and higher environments.

Typical VPN-only gateway setups seem to utilize 16 SND cores (due to the 16 rx queue limit for ixgbe) while all other cores stay mostly idle. So for high end appliances, with >40 cores, most of their CPU power cannot be used.

As an example, one of our 23800 appliances running R80.40 will overload 16 cores at about 8000 remote users with 5GBit/s of real world traffic (I know the 23800 is dated, but this is not the point).

So my question to the community is: Am I missing something and is there a way to more evenly distribute the load to all cores in a VPN gateway?

Best regards

Matthias

 

 

 

 

 

0 Kudos
2 Solutions

Accepted Solutions
HeikoAnkenbrand
Champion
Champion

Hi @mstraub,

1) When SecureXL is enabled, Encrypt-Decrypt actions usually take place on SecureXL level (on CPU cores running as CoreXL SND). All VPN traffic will be handled on the CPU cores running as CoreXL SND under the following conditions:

  • Only "Firewall" and "IPSec VPN" software blades are enabled
  • There are no fragmented packets
  • SecureXL acceleration is not disabled by any of the security rules (refer to sk32578)
  • VPN features that are disqualified from SecureXL (see below) are disabled

2) Choose an encryption algorithm that is AES-NI compatible (AES-128, AES-256)

3) Optimize the interface affinity and enable multi queuing

4) Check that you have no or low fragmented packets rate

---

The following SK describes the performance optimisation for VPN:
sk105119: Best Practices - VPN Performance

Here you will find many more performance tuning tips:
R81.x Architecture and Performance Tuning - Link Collection


➜ CCSM Elite, CCME, CCTE

View solution in original post

Timothy_Hall
Champion
Champion

You cannot go beyond 16 queues for a ixgbe-driven NIC, it is a driver limitation.  The only way to do so would be to acquire and install a new Intel NIC card utilizing the i40e driver which supports at least 48 queues (but beware that certain older NIC hardware has lower limits) or a Mellanox card (driver mlx5_core) which supports at least 60 queues.  Dynamic balancing/split will not really help due to this limit once there are 16 or more SNDs allocated, as it will always use the same 16 cores for all interface processing; I think this has to do with how the IRQs are affined.

However given that practically all the processing happens on the SNDs in this case I agree with the other posters that disabling SMT will help here, potentially a lot.  SMT benefits Firewall Workers/Instances a fair amount, but an SND operates much better with full access to a single physical core rather than 2 separate SNDs vying for the same physical core via SMT threads.

The only other way to utilize more cores would be to leave SMT enabled, disable VPN acceleration, then substantially reduce the number of SNDs, thus forcing all the VPN operations onto all of the worker cores where you could go well beyond 16.  However the VPN processing as implemented in the worker cores is substantially less efficient than the SNDs, and I don't know if you would actually gain all that much overall throughput with this approach even though you would be utilizing many more cores; doing so might even make things worse.

Watch My 2023 CPX360 Speech Titled "Max Power
Reloaded: R81+ Gateway Performance Innovations"

View solution in original post

14 Replies
Chris_Atkinson
Employee
Employee

Is dynamic balancing enabled on this system?

Depending on the enabled blades & encryption algorithms at play this is one application where you might consider disabling SMT.

But I'm sure others such as @Timothy_Hall will also have an opinion on this.

0 Kudos
HeikoAnkenbrand
Champion
Champion

Hi @mstraub,

1) When SecureXL is enabled, Encrypt-Decrypt actions usually take place on SecureXL level (on CPU cores running as CoreXL SND). All VPN traffic will be handled on the CPU cores running as CoreXL SND under the following conditions:

  • Only "Firewall" and "IPSec VPN" software blades are enabled
  • There are no fragmented packets
  • SecureXL acceleration is not disabled by any of the security rules (refer to sk32578)
  • VPN features that are disqualified from SecureXL (see below) are disabled

2) Choose an encryption algorithm that is AES-NI compatible (AES-128, AES-256)

3) Optimize the interface affinity and enable multi queuing

4) Check that you have no or low fragmented packets rate

---

The following SK describes the performance optimisation for VPN:
sk105119: Best Practices - VPN Performance

Here you will find many more performance tuning tips:
R81.x Architecture and Performance Tuning - Link Collection


➜ CCSM Elite, CCME, CCTE
mstraub
Participant

Thank you, Chris and Heiko,

I am aware of the performance SKs and also your write ups, which are very helpful.

My challenge is, though, that I seem to have all optimizations enabled (including AES-NI, only FW and IPSEC blades, no visitor mode, multi-queueing for all relevant interfaces, SecurXL for everything, no IP fragments).

So all VPN handling is done in the SND cores and there are plenty of them available - however only 16 SND cores process traffic and the others are idle. This seems to be by design due to the multi queue limit of 16 rx queues for 10G interfaces. Quite obviously all Internet traffic arrives at one 10G interface and is then handled by only up to 16 cores.

Now our customer is looking at his system monitoring and sees his firewalls hit the load limit while most CPUs are idle.

So I am looking at confirmation that this is by design and nothing can be done about it or - even better - a way to solve this. An obvious idea would be to increase the rx queues to 32 or similar. I am not sure if this can be done, though, or if there is a hardware limitation. Another idea would be to connect a second Internet interface - however I would really like to avoid a major workaround like this.

Best regards

Matthias

 

 

 

0 Kudos
Chris_Atkinson
Employee
Employee

From a multi-queue perspective it depends on the interface driver:

https://sc1.checkpoint.com/documents/R81/WebAdminGuides/EN/CP_R81_PerformanceTuning_AdminGuide/Topic...

To confirm you've already explored the impacts of SMT / HT ?

0 Kudos
mstraub
Participant

Thank you, Chris

I read from this, that 16 queues/cores is the limit. Unfortunately the firewall also choses the same cores for all interfaces.

To illustrate my question, please take a look at the output of "mq_mng --show".

 

Total 48 cores. Multiqueue 40 cores
i/f type state mode cores
------------------------------------------------------------------------------------------------
Sync igb Up Auto (2/2) 0,24
eth1-01 igb Up Auto (8/8) 0,24,1,25,2,26,3,27
eth2-01 ixgbe Up Auto (16/16) 0,24,1,25,2,26,3,27,4,28,5,29,6,30,7,31
eth2-02 ixgbe Up Auto (16/16) 0,24,1,25,2,26,3,27,4,28,5,29,6,30,7,31

There are 40 cores available for the two 10G plus the two 1G interfaces. However only 16 cores are used in total for all interfaces. Can this be changed?

Best regards

Matthias

PS: I am not sure if SMT/HT would affect this

 




0 Kudos
Chris_Atkinson
Employee
Employee

 

From memory a 23800 has 24x physical cores (SMT disabled), 48x virtual cores (SMT enabled).

Exercise caution but potentially disabling SMT would give you better bang for buck with your SNDs assuming your remaining cores were idle as you say.

mstraub
Participant

@Chris_Atkinson wrote:

 

From memory a 23800 has 24x physical cores (SMT disabled), 48x virtual cores (SMT enabled).

Exercise caution but potentially disabling SMT would give you better bang for buck with your SNDs assuming your remaining cores were idle as you say.


Thanks, Chris,

Good point, reducing the number of cores by disabling HT could improve things for a 23800. I guess we will try this.

Best regards

Matthias

0 Kudos
Timothy_Hall
Champion
Champion

You cannot go beyond 16 queues for a ixgbe-driven NIC, it is a driver limitation.  The only way to do so would be to acquire and install a new Intel NIC card utilizing the i40e driver which supports at least 48 queues (but beware that certain older NIC hardware has lower limits) or a Mellanox card (driver mlx5_core) which supports at least 60 queues.  Dynamic balancing/split will not really help due to this limit once there are 16 or more SNDs allocated, as it will always use the same 16 cores for all interface processing; I think this has to do with how the IRQs are affined.

However given that practically all the processing happens on the SNDs in this case I agree with the other posters that disabling SMT will help here, potentially a lot.  SMT benefits Firewall Workers/Instances a fair amount, but an SND operates much better with full access to a single physical core rather than 2 separate SNDs vying for the same physical core via SMT threads.

The only other way to utilize more cores would be to leave SMT enabled, disable VPN acceleration, then substantially reduce the number of SNDs, thus forcing all the VPN operations onto all of the worker cores where you could go well beyond 16.  However the VPN processing as implemented in the worker cores is substantially less efficient than the SNDs, and I don't know if you would actually gain all that much overall throughput with this approach even though you would be utilizing many more cores; doing so might even make things worse.

Watch My 2023 CPX360 Speech Titled "Max Power
Reloaded: R81+ Gateway Performance Innovations"
mstraub
Participant

Thank you, Timothy

I think this sums it up very well.

Disabling HT sure will look better and hopefully give some extra performance.

Changing to a 40G interface would also increase the number of queues as well as cores used.

0 Kudos
Tobias_Moritz
Advisor

@mstraub Just a small hint if not known: When sticking with Intel, you do not have to use an actual 40G interface card to go beyond the 16 hardware queue limit, the controller just has to be from the 40G family, The famous X710 for example is available as 10G interface card (even with 10GBase-T if you want) but its a 10G/40G controller and uses the i40e driver.

Chris_Atkinson
Employee
Employee

Note this is a Check Point appliance not open server 🙂

0 Kudos
the_rock
Legend
Legend

One thing I always do first when VPN performance is degraded is run vpn accel off comand for that tunnel. That usually seems to make a difference.

Below is the reference:

https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsolutiondetails=&solut...

0 Kudos
jameson7
Explorer

using split-tunneling configurations and new infrastructure that supports up to 500K simultaneous connections. The new design uses Windows 10 VPN profiles to allow auto-on connections, delivering a seamless experience for our users.

Ref. https://www.microsoft.com/en-us/insidetrack/enhancing-remote-access-in-windows-10-with-an-automatic-...

0 Kudos
_Val_
Admin
Admin

I believe the original question here was about Check Point remote access VPN performance limitations, not about Microsoft. Posting a link to MS white paper is kinda pointless...

 

0 Kudos