Part 5 - Recent Performance Enhancements

_Val_ · ‎2020-06-22

Multi-Queue & “RX-DRP”

As mentioned in the last article, cores designated as Secure Network Distributors (SNDs) have several responsibilities beyond just processing fully-accelerated traffic in the SXL path. SNDs also try to keep the CPU load among the Firewall Workers evenly balanced via a function called the Dynamic Dispatcher. SNDs have one other critical function: transferring incoming Ethernet frames from the actual Network Interface Card(NIC) hardware to the Gaia system memory for subsequent processing. This critical transfer function is called SoftIRQ (Soft Interrupt ReQuest).

In our last article, we detailed a situation in which the CoreXL “split” required adjustment to add more SNDs thus reducing the number of Firewall Workers. This change was made due to an imbalance of CPU load between the SNDs and Firewall Workers, which was overloading the SNDs. Allocating more SNDs makes more cores available for the critical SoftIRQ function, but normally one (and only one) SND core can process Ethernet frames from a single network interface.

However even when sufficient SNDs have been allocated on the firewall and they are not overloaded, Ethernet frames can arrive faster from a very busy NIC than a single SND core can process them via SoftIRQ. When this occurs frames can be lost. This problematic situation is known as a “receive drop” (RX-DRP) or a “buffering miss”. The command netstat -ni will show the number of frames successfully received (RX-OK) by the firewall, and the number of frames dropped (RX-DRP) on an interface because the SND could not process the incoming Ethernet frames fast enough via SoftIRQ.

To mitigate this frame loss situation, Multi-Queue permits more than one SND core to simultaneously process frames from a single network interface. On firewalls employing the Gaia 2.6.18 kernel the administrator must manually enable Multi-Queue on the busy interfaces that need it (for up to 5 interfaces at ince), but on the Gaia 3.10 kernel Multi-Queue is enabled by default on all network interfaces except the management interface. In the real world Multi-Queue is usually necessary on 10Gbps+ interfaces that have at least 4-5Gbps of traffic passing through them; Multi-Queue is normally not helpful on 1Gbps or slower interfaces. For more information on Multi-Queue see sk153373: Multi-Queue Management for Check Point R80.30 with Gaia 3.10 kernel.

Simultaneous MultiThreading (SMT) a.k.a Hyperthreading

SMT is a feature introduced by Intel to help ensure more efficient use of the available cores. SMT is enabled on almost all Check Point appliances that support it, including almost all new Check Point firewall appliances currently sold by Check Point. When SMT is enabled, twice the number of physical cores actually available in the hardware are presented to the Gaia operating system for use. For example if a system has eight physical cores actually available in the hardware, when SMT is active sixteen cores will appear to be available to the Gaia operating system. SMT accomplishes this by creating two separate “threads” of execution on each physical core; obviously the number of real physical cores has not changed. So essentially 2 Firewall Worker instances are sharing the same physical core.

On most real-world firewalls, SMT yields an approximately 30% boost in overall firewall performance. While such a hefty gain may at first seem impossible, keep in mind that while Firewall Worker instances are frequently very busy, unless they are 100% utilized (with 0% idle time) there are frequent periods of inactivity waiting for more packets to arrive for processing. When this happens to one of the Firewall Worker instances sharing a core, the other instance can immediately commence processing while the other instance is waiting. So essentially SMT helps ensure that a physical core spends as much time as possible actively inspecting network traffic instead of sitting idle doing nothing.

As mentioned earlier, SMT is enabled by default on all Check Point appliances that support it and SMT should not be disabled in most environments. However if more than 80% of the traffic crossing the firewall is fully accelerated by SecureXL (this is not very common in the real world!) it can be beneficial for overall firewall performance to disable SMT, as the SNDs handling the accelerated traffic do not benefit from SMT nearly as much as the Firewall Worker instances do. For more information about SMT see sk93000: SMT (HyperThreading) Feature Guide.

Fast Accelerator

As mentioned in the last article, it is desirable to have traffic passing through the firewall be processed in the most efficient path possible to save CPU overhead and decrease latency. The most efficient path is called SXL, or the “fastpath”, or the “Accelerated Path” or just “fully-accelerated”. Traffic handled in this path consumes the least amount of CPU time and introduces the lowest possible latency. But in the real world on most firewalls, not very much traffic is eligible for handling in the “fastpath”. In general the more blades and features are enabled on a firewall, the less traffic can be fully accelerated.

This situation can become problematic with so-called “elephant flows”; Check Point also calls these “heavy connections”. Elephant flows are single connections passing through the firewall at very high speed (1Gbps+) with a lot of data to send and receive. Classic elephant flows would be operations such as system backups and replications. While the firewall configuration can be tuned to attempt to handle these elephant flows in the most efficient path possible, getting them all the way into the most-efficient SXL path solely via tuning adjustments can be difficult. Enter the SecureXL “Fast Accelerator” (fast_accel) feature.

When enabled, traffic matching certain IP addresses and/or port numbers is forced into the fastpath regardless of what the security policy dictates, thus ensuring the most efficient handling of this heavy traffic. Note that to some degree fast_accel is “whitelisting” the matched traffic, and not all inspection operations called for in the security policy will necessarily be performed on this traffic. As such it is NOT recommended to use the fast_accel feature with traffic flows going to or from systems that are untrusted (such as 3^rd party servers on the Internet). Fast_accel is disabled by default on all Check Point firewalls and must be manually configured by an administrator. For more information: sk156672: SecureXL Fast Accelerator (fw fast_accel) for R80.20 and above.

User-Space Firewall (USFW)

Due to a Gaia kernel memory limitation, no more than 40 cores can be utilized by CoreXL even if there are more than 40 cores available in the firewall hardware. Because the Firewall Workers are normally created in the kernel of the Gaia operating system as kernel drivers/modules, they are subject to this memory limitation.

USFW removes the 40-core limitation imposed by the kernel memory limit, and allows creation of as many Firewall Worker instances as there are cores available. When USFW is enabled, Firewall Worker instances are created in the Gaia OS as processes instead of kernel modules/drivers. In most cases this is a distinction without a difference, as a firewall with USFW enabled behaves more or less the same as a firewall that does not have it enabled. The key differentiation of USFW is the ability to use more than 40 cores on the firewall, thus increasing performance.

Generally the state of USFW (either enabled or disabled) should not be changed from its default factory value. For exceptions to this and for more information about USFW consult the following SK: sk167052: Check Point User-Space firewall support for R80.30 3.10 and above.

About the author

Performance Optimization Series are written for you by Timothy Hall.

Timothy has continuously worked with Check Point products starting in 1997, been an instructor for official Check Point training classes since 2004, and is the author of the book "Max Power 2020: Check Point Firewall Performance Optimization".