Re: Performance Tuning - Bufferbloat

HeikoAnkenbrand · ‎2023-09-01

Bufferbloat is a cause of high latency and jitter in packet-switched networks caused by excess buffering of RX packets. Bufferbloat can also cause packet delay variation (also known as jitter), as well as reduce the overall network throughput. When a firewall is configured to use excessively large buffers, even very high-speed networks can become practically unusable for many interactive applications like VoIP, audio streaming and even ordinary web browsing.

An established rule of thumb for the network equipment manufacturers was to provide buffers large enough to accommodate at least 250 ms of buffering for a stream of traffic passing through a device. For example, a router's Gigabit Ethernet interface would require a relatively large 32 MB buffer. Such sizing of the buffers can lead to failure of the TCP congestion control algorithm. The buffers then take some time to drain, before congestion control resets and the TCP connection ramps back up to speed and fills the buffers again. Bufferbloat thus causes problems such as high and variable latency, and choking network bottlenecks for all other flows as the buffer becomes full of the packets of one TCP stream and other packets are then dropped.

The only thing I know about Check Point is that you can change the buffer size.

In modern kernel versions there is the possibility to change the buffer algorithms. A few modern high-end routers and firewalls have a feature called "Smart Queue Management" - also known as SQM. When SQM is enabled and properly configured, a router or firewall with SQM can eliminate most bufferbloat problems.

There are many different versions of SQM, and some are more effective than others. This allows the buffer to be better adapted to the network traffic. All modern linux distributions now ship with for example fq_codel, sch_fq, fq_PIE, ... buffer algorithms.

Now my question:

1) Is it possible to change the buffer algorithm on a gateway?
2) If yes, which buffer algorithms are recommended for 10Gbps /100Gbps network?
3) If not, will this be supported in future versions (e.g. in R82 with newer Linux kernels)?

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

PhoneBoy · ‎2023-09-01

I suspect the answers will be in R82, given we will be on a newer RHEL.

Timothy_Hall · ‎2023-09-02

Yes the queueing strategy can be changed on Gaia using a command such as tc qdisc add dev eth0 root fq_codel limit 2000 target 3ms interval 40ms no ecn. However doing so is almost certainly not supported.

A bit of a cautionary tale: When putting together my Gateway Performance Optimization Course I was able to introduce latency, packet loss, and bandwidth policing via VMWare Workstation. This is key in some of the early labs to see what these conditions actually look like when probing the network with tools like ping, tracepath, traceroute, and iperf3. However the one thing VMWare could not simulate was jitter, but after some research I realized the Gaia/Linux command tc could simulate jitter on an interface via something like this: tc qdisc add dev eth0 root handle 1: netem delay 10ms 100ms. However as soon as I tried this command the R81.20 gateway HARD HUNG in VMWare. No console, no SSH access, it was completely locked up; I had to power cycle the VM to regain control. My guess is that SecureXL/sim has a lot of tendrils inserted into the low-level Linux networking functions and my attempt ran afoul of that. So yes the needed commands are there to enable fq_codel and it seems to work, but I'd strongly advise against it.

Attend my 60-minute "Be your Own TAC: Part Deux" Presentation
Exclusively at CPX 2025 Las Vegas Tuesday Feb 25th @ 1:00pm

HeikoAnkenbrand · ‎2023-09-03

Thank you @Timothy_Hall for describing your experience.

A Queue Discipline (qdisc) is an algorithm that determines how to handle traffic shaping. There are several algorithms.

R81.20 Uses the following pfifo_fast 0.

# tc qdisc show dev eth0
qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1

This is the default queue discipline for Linux. The acronym fifo is a common computing term meaning "first in, first out". In the context of QoS this means that the bytes (packets) that enter the QoS subsystem are queued. When dequeued, the bytes that entered first are the ones that leave first. The discipline uses 3 distinct queues, also known as priorities (0, 1 and 2).
pfifo_fast.

Note that qdisc uses pfifo_fast in this output. This output is for dev eth0. You should have at least one pfifo_fast qdisc for each configured interface.

Everything is clear to me up to this point. It is interesting to note that from kernel 3.12 (R81.20 still has 3.10 kernel) onwards, more effective algorithms have been added to linux.

In high-performance environments, I see that the current algorithm is not 100% optimal. As a result, layer 4 (TCP) is affected by longer queue times. Thus, in extreme cases it can happen that we generate TCP retransmissions through the firewall. This can be counteracted by smaller buffers this reduces the throughput times of the buffer "first in, first out" and therefore does not affect the higher layer 4 (TCP) as much.

However, it would be more ideal to use more effective buffering methods. Unfortunately, this is not possible with Kernel 3.10. If we change the current algorithms, we will lose support. So, from my point of view, there are no optimisation possibilities at the moment.

Let's wait and see what is possible with R82 and newer linux kernel.

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

dtaht · ‎2023-09-03

A couple notes (co-author of fq_codel, cake and multiple other latency fighting algorithms in the linux kernel, here).

Yes, you have to be careful when fiddling with netem, especially with old kernels, but a failure to have netem work right had nothing to do with your prior successful application of fq_codel. I have many rants about how to use netem properly elsewhere.

fq_codel entered the kernel as of about linux 3.4, so it should be available. However there was an issue with GRO + HTB + fq_codel not resolved until 3.12. But if GRO is not enabled, and you are not using HTB, fq_codel as a native qdisc should work fine, and you can see it in operation via running tc -s qdisc show after subjecting your link to some load. If it is working you might see some backlogs, drops, or ecn_marks. Reschedules show the fq part working.

Most ethernet devices today have multiple queues (the mq qdisc), and rather than applying it to the top level interface as you did, the right way to apply fq_codel universally is to have it available in the kernel and configured via the appropriate sysctl, which mq will automatically pick up. There is a race in some distributions between enabling the interface and fq_codel.ko inserted as a module, on boot.

Most major linux distributions, in the years since 2012, starting with openwrt, switched to fq_codel over the past decade. It became the default in RHEL 8, for example. Ideally it is compiled into the kernel and made the default qdisc there.

This works at line rate, and with ethernet pause frames. The fq-ing portion, especially breaks up large packet trains and ensures voip and videoconferencing and other forms of smaller traffic observe no queue from the fatter flows.

The sqm-scripts can be applied to any distribution to further shape the connection down below line rate and manage latency both on the up and down, better, Example might be a cable connection, configured for 110mbit/22mbit by the ISP, but with horrible bufferbloat (I have seen +500ms!), shape it down to 100/20mbit and the connection will feel much smoother and handle a lot more users.

I had no idea anyone was still shipping 3.10!! I did a backport of sch_cake (a superset of fq_codel with many new features, which has an integral shaper) for ubnt years ago, it is available out of tree on github. That makes a lot of complex Qos rather easy, and you can also use it as a default qdisc as per the methods I outlined above. Configuring it is also supported by the sqm-scripts.

But cake shaping outbound is simpler. a one liner : tc qdisc add dev eth0 root bandwidth 20Mbit ack-filter nat diffserv4

It is four lines to configure it to shape inbound. I believe RHEL9 ships cake also, but fq_codel remains the default because it is faster, but if you have the cpu, I recommend trying cake highly. Hope this helps!

HeikoAnkenbrand · ‎2023-09-03

Thanks @dtaht for the information.

Question is, what can we do in 10/40/100 Gbps networks at the moment, other than reducing/increasing the buffer.

I see time and time again on high performance firewalls that this creates the consequences in the higher layers (TCP) and we create a lot of TCP retransmissons as a result.

Let's wait for R82;-)

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

dtaht · ‎2023-09-04

We have cake scaling to about 10Gbit´s per core in the libreqos.io project. The biggest bottleneck as you try to crack 10Gbit is on the read, not the write path, and eBPF and dpdk seem to be the best ways to improve that. I am painfully aware of how much moving firewalling also to userspace is.

dtaht · ‎2023-09-03

"tc qdisc add dev eth0 root fq_codel limit 2000 target 3ms interval 40ms no ecn" As noted in my longer post, you can make fq_codel the default with a sysctl. Also, the defaults in fq_codel are usually good enough for most internet traffic.

The default limit is 10000 packets, which is good to well over 10Gbits of traffic. It is better to use the memlimit option than the packet limit option, due to GRO and TSO bulking up packets. It defaults to 32MB on most OSes. again good to well past 10Gbit. The default in openwrt for the memlimit is 4MB which is getting to be a bit small for Gbit traffic, and definately too small for 10Gbit.

The interval should be set to around the 98th percentile of your observed RTT. Many countries are much smaller than default of a 100ms interval, but 100ms has been shown to scale well to worldwide (280ms) traffic.

The target should be set to 5-10% of the interval and greater than 1 MTU at the egress rate of the interface. For example, at 1Mbit, 1 MTU takes 13ms! but leave the interval the same.

ecn on is the default for codel and works well for rfc3168-enabled traffic to provide lossless congestion control. There is some churn going on as there is a new IETF standard (L4S) that does ecn slightly differently, and it is anyones guess how quickly that will roll out, but in my view some ecn support is better than none. fq_codel support for L4S landed in linux 6.1.

We tried very hard to make the defaults apply at internet scale. People are perpetually increasing the target to crazy values (35 for example) and then wondering why it takes forever to do anything. It is a target, and by attempting to hit it, and rarely getting there, it smooths out traffic. The interval gives you burst tolerance. Now at data center (DC)-scale, where all your traffic might take 5ms to transit the whole local cloud and does not go to the internet, you can certainly fiddle with the target and interval to match, I run in my DC with normal tcp (and ecn) at target 250us, interval 1ms. Wifi tends towards bursty, nobody has a really good answer for it, with a tuned AP and fq_codel running native on that AP (see the ending the anomaly paper), we presently get down to 8ms and 80ms.

Are you a member of CheckMates?

Performance Tuning - Bufferbloat