Re: TCPdump active: performance increases and prob...

Martin_Seeger · ‎2020-10-01

Hi,

we have a very strange effect:

On a Check Point Appliance 23900 with VSX R80.30, we have trouble with a CIFS/SMBv2 (Server Message Block protocol) session through which multiple (concurrent?) transfers are being sent within a single TCP-445 connect.

While debugging, we noticed a strange effect: When we turn on tcpdump on the gateway within the VS environment on one interface (toward which the traffic is directed), two things happen:

The performance of the throughput increased by about 50%
Our problem disappears.

What does tcpdump on the gateway, that may lead to this behavior? By knowing that we would have a lead on the eoot cause of the problem.

Yours, Martin

P.S. The problem is that the concurrent transfer of five files fails. First we see (in a tcpdump on the client) a dramatic increase of the TCP ACK RTT (up to 40s) and after some time, the transfer is aborted. This is not necessarily a firewall problem, but it is strange that the problem does not occur when we run a tcpdump while testing.

Updates:

SMBv2 is used
Effect does not depend on which interface it is run
Multi-queuing is active on the affected system
All relevant interfaces are bonds.

John_Fleming · ‎2020-10-01

That is odd.. this doesn't explain anything but I wonder if you set the promisc flag on that interface if it has the same effect?

ip link set [interface] promisc on

Martin_Seeger · ‎2020-10-01

@John_Fleming This was my first thought, but the effect also occurs when tcpdump is run in non-promiscous mode.

John_Fleming · ‎2020-10-01

That is really odd.. I wonder if there is a interrupt mitigation mode that is being turned off in the nic driver once tcpdump gets involved.

Timothy_Hall · ‎2020-10-02

Hmm, when tcpdump is active there is suddenly a "registered receiver" for all traffic not just certain EtherTypes, regardless of whether promiscuous mode is active. Traffic received on the NIC for which there is not a registered receiver will be discarded and RX-DRP incremented; I suspect that some of the LACP or other control frames are being discarded somehow and affecting the proper operation of the bond. Running tcpdump lets those control frames through and things start working as they should. Unfortunately the only indication something like this is happening is RX-DRP being incremented when tcpdump is not active, the discarded frames are not "errors" per se so there is no indication of an issue in the ethtool -S statistics for the interface. This specific RX-DRP situation seems to happen much more often with the new Gaia 3.10 kernel for some reason.

Duane mentioned the cat /proc/net/bonding/bondX command which gives a very high level of detail about the bond's operation, might be worth comparing the output of this command when tcpdump is running vs. when it is not for clues.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

Martin_Seeger · ‎2020-10-02

Thanks, handed this over to the people in the trenches. I will report on the progress.

If we had trouble with the bond, I would expect more problems (not just one server). But who knows...

We are joking that the network is too fast for the application and it requires tcpdump to slow it down in order to work ;-).

Martin_Seeger · ‎2020-10-02

No change between tcpdump and no tcpdump on the bond (output from diff):

--- tcpdump-off    2020-10-02 16:11:44.000000000 +0200
+++ tcpdump-on    2020-10-02 16:14:12.000000000 +0200
@@ -1,31 +1,31 @@
+###

How did we do the tcpdump and bond-parameters:

[Expert@vsx***:15]# tcpdump -nni bond0.372 -v -w kka-dummy2.pcap host 10.**.**.**

 [Expert@vsx***:0]# cat /proc/net/bonding/bond0
 Ethernet Channel Bonding Driver: v3.2.4 (January 28, 2008)

 Bonding Mode: IEEE 802.3ad Dynamic link aggregation
 Transmit Hash Policy: layer3+4 (1)
 MII Status: up
 MII Polling Interval (ms): 100
 Up Delay (ms): 200
 Down Delay (ms): 200

 802.3ad info
 LACP rate: slow
 Active Aggregator Info:
     Aggregator ID: 1
     Number of ports: 2
     Actor Key: 33
     Partner Key: 32979
     Partner Mac Address: 00:23:04:**:**:**

 Slave Interface: eth3-02
 MII Status: up
 Link Failure Count: 0
 Permanent HW addr: 00:1c:7f:**:**:**
 Aggregator ID: 1

 Slave Interface: eth3-01
 MII Status: up
 Link Failure Count: 0
 Permanent HW addr: 00:1c:7f:**:**:**
 Aggregator ID: 1

Duane_Toler · ‎2020-10-02

Wow, that's indeed puzzling. LACP info there looks good. I see hashing is layer3/layer4, so all CIFS traffic for that one user's session will be traversing just one of the two links. If another user makes a CIFS connection to the same server, that should go across the other link (make sure the two users don't both have even or odd numbered IPs; they'll need to be opposite each other, so 192.0.2.1 and 192.0.2.2).

I presume spanning-tree on the switch is good for that VLAN on the portchannel interface, and both member ports are configured the same as the portchannel interface? Does the switch see the firewall on the portchannel? (NX-OS and Arista EOS: show lacp interface ethernet x/y, for each LACP member port). You can also check the switch's load-balancing algorithm (show etherchannel load-balance) to see how and where it will be transmitting frames back to the firewall.

Curious, tho, the Linux bonding driver documentation says layer3+4 hashing isn't entirely 802.3ad compliant. (https://www.mjmwired.net/kernel/Documentation/networking/bonding.txt#890) and fragmented packets make the balance algorithm worse (by discarding layer3 info) and causing out-of-order delivery.

Duane_Toler · ‎2020-10-01

With LACP involved, can you check your switchports to see if they are sending active LACPDUs? Those ports will use "channel-group X mode active" (versus passive or "on"). I had one customer with strange interface issues until the switch port config was adjusted. The Linux bonding driver doesn't send active LACPDUs (by default, at least), so the switch port should be configured to do the work.

You can check the LACPDU values with tcpdump on the physical interface:

tcpdump -nni eth# ether proto 0x8809 -xXvv -s 0

Read the LACPDUs "Partner information" section and see if the system ID field is all zeros like this:

Partner Information TLV (0x02), length: 20
System 00:00:00:00:00:00, System Priority 65535, Key 1, Port 1, Port Priority 255
State Flags [Activity]
0x0000: 0000 0000 0000 0001 00ff 0001 0100 0000
0x0010: 0310

(this the output during my customer's problem). The Linux bonding driver will report the same (cat /proc/net/bonding/bond0). When the switch port is changed, you'll see some value in the System ID section.

You could also experience other "weirdness" with LACP if the two ends don't exchange info in the LACPDU. The hashing algorithm on each end should be such that one interface is used for upstream traffic and the other for downstream (as best you can). Changing the etherchannel load-balance algorithm on the switch changes it for *ALL* port-channels (sadly), so you will have some limits here.

With the LACP hashing at its default (layer2 hashing), you *could* cause a situation where traffic is largely flowing across one interface. You don't want both ends with the same hashing algorithm, either, as they'll both decide to use the same link, an the transmitting end strictly controls the hashing result for that frame; the receiving end has no power to influence things. Your best-best algorithm is something involving layer-3 information (most universally compatible).

If you have incomplete LACPDU information, and the portchannel is not really intact on the switchport side, you could have the issue my customer had with reply packets being strangely dropped and never received by the firewall kernel. Theirs manifested as a weird DHCP problem at first, but they did have a performance issue with hosts that were able to get connected.

You could also have a bad cable... I've found so many bad cables or pinched cables in datacenters and server rooms. Check your switch port for any interface errors, too. Even "good" cables have been known to go bad.

Martin_Seeger · ‎2020-10-02

We checked the bond-parameters with and without tcpdump running. No change (see reply above to Timothy).

Timothy_Hall · ‎2020-10-02

Wow, that's a strange one alright. Can't really think how else tcpdump would make things better like that, but in very general terms when running tcpdump suddenly fixes something it usually means there is a problematic configuration setting somewhere. Not real helpful or specific I know, but here is an example:

sk107496: ClusterXL in High Availability mode starts passing traffic only if TCPdump is started

I suppose it may be some kind of strange interaction with Multi-Queue, try posting output of mq_mng –o –vv. Perhaps while tcpdump is running it subtly corrects some kind of imbalanced queueing between the cores like this:

https://community.checkpoint.com/t5/VSX/Multiqueue-CPU-load-excessive-on-two-cores-R80-30-T215-3-10/...

Otherwise probably going to need a TAC case for this one, please report back to this thread what is found.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

Martin_Seeger · ‎2020-10-04

@Timothy_Hall Thanks again for the additional pointers.

The thing is absolutely weird.

The firewall is completely bored (the VS is using less than 10% of the assigned ressources).
We only started the tcpdump to help the guys debugging it, now suddenly the firewall is a suspect.
But in total the evidence points in a different direction: when they put the client into the same network as the server, performance decreases and aborts also happen.
The client is transferring files per SMB to the server. We thought it to be SMBv2, but they transfer multiple files concurrently (so that points to SMBv3) through a single tcp-445 connections.
If they transfer single files (one after the other) the problem does not occur. So we are tinkering with a clean (without CIFS protocol) service.

Timothy_Hall · ‎2020-10-04

Your last bullet point sure sounds like the application is encountering some kind of race condition when transferring multiple files, an example would be a process executing an action, then going to sleep waiting for the result to arrive via some kind of interrupt/signal. A race condition occurs when everything is being handled so fast that the interrupt arrives before the process can even go to sleep to wait for it. Is the application running on extremely fast clients and servers with SSD's or something like that?

Libpcap simply T's or gets copies of frames at the NIC driver level for consumption by tcpdump. This doesn't usually slow down things to a great degree unless the firewall is very busy already which is not happening in your case. However the increase in latency when running a tcpdump is certainly not zero.

One other remote possibility is that packets are being delivered out of order (which really screws up performance), and running tcpdump somehow gets them back in the right order.

An interesting experiment might be to deliberately slow/limit the application's bandwidth, via a Limit action in a APCL/URLF-capable policy layer, or via the QoS blade, and see what happens. Keep in mind though that the limit is enforced by dropping packets and not delaying/queueing them, so that may just exacerbate the problem further.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

HeikoAnkenbrand · ‎2020-10-04

Hi @Martin_Seeger,

Running tcpdump on an interface puts the interface in promiscuous mode - the interface will accept all traffic it receives on the network. This causes the L2 stack of a Linux system to work completely different. Thus, packets are forwarded which may be blocked by non-promiscuous blocked packets.

Furthermore, the CoreXL multi core functionality is limited when promiscuous mode is enabled through tcpdump.

Running tcpdump causes a significant increase in CPU usage and as a result impact the performance of the device.
Even while filtering by specific interface or port still high CPU occurs.

More read here: R80.x - Performance Tuning and Debug Tips - TCPDUMP vs. CPPCAP

If that brings positive changes, that is nice. In a VSX environment I would be very careful. But I would open a TAC ticket, because the consequences are not foreseeable.

PS: I programmed promiscuous mode drivers and developed software for them in the past (long time ago). Starting from Linux Kernel 3.1 the promiscuous mode was strongly revised. Here the RX buffer handling is completely different. This can have some side effects in the layer 2 stack, because e.g. layer 2 stop packets, broadcast packets and multicast packets are processed differently. I think this may also have an impact on IEEE 802.3ad (LACP load sharing).

➜ CCSM Elite, CCME, CCTE

Timothy_Hall · ‎2020-11-27

Hi Martin,

Any updates on this issue? A puzzler for sure... Does using cppcap instead of tcpdump yield the same results?

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

Kaspars_Zibarts · ‎2020-11-27

Was it 2.6 or 3.10 kernel you are running? sorry reading on mobile, thread is long. For us upgrading vsx to 3.10 kernel (had to upgrade though to R80.40 as we run R80.30 on 2.6) made massive difference on MQ performance. Just wondering that can help you..

Martin_Seeger · ‎2020-11-30

We're running still 2.6. Looking for a time slot to make the update to 3.10. Thanks for the pointer....

Martin_Seeger · ‎2020-11-30

Update, but not a resolution yet. We implemented "fast acceleration" which improved the situation and a cluster switch improved it even more. Customer request was to "leave it that way" while in a crtical phase.

"tcpdump" and "cppcap" showed the same behavior while "fw monitor" had no effect.

Timothy_Hall · ‎2020-11-30

Thanks for the update, your statement that tcpdump and cppcap have the same effect would definitely seem to implicate libpcap as the culprit here. In kernel 2.6.18 the libpcap version was 0.9.4 (circa 2005), while in Gaia 3.10 it has been updated to 1.5.3 (circa 2013). However Check Point has been known to silently patch elements of the Linux OS, so these reported version numbers may not be entirely accurate. Looking at the changelog for libpcap (https://www.tcpdump.org/libpcap-changes.txt), there were some fixes between 0.9.4 and 1.5.3 that may be relevant to your situation:

1.0.0: Better support for dealing with VLAN tagging/stripping
1.2: Fix configure-script discovery of VLAN acceleration support
- see http://netoptimizer.blogspot.com/2010/09/tcpdump-vs-vlan-tags.html
1.4.0: Fix handling of VLAN tag insertion to check, on Linux 3.x kernels, for VLAN tag valid flag 1.4.0
1.5.0: TPACKET_V3 support added for Linux (https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt)
1.7.3: Work around a Linux bonding driver bug (fix not included with Gaia 3.10)

It would be interesting to try the Gaia 3.10 version of libpcap if possible and see if it has the same effect on your application. This has got to be some kind of strange interaction with bonds or VLAN tagging when tcpdump/cppcap is running via the older version of libpcap.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

Martin_Seeger · ‎2021-03-17

I think we cracked the nut:

The application is sensitive to reordering packets.
When packets pass the firewall, packets have a variable time to pass the stack (espescially with IPS enabled). This can lead to a reordering.
"tcpdump" slightly slows the flow of packets. That is enough to get the packets "back in order".

Ultimately the solution was to enable "fast accel" for this application. This unifies packet transition times and avoids reordering.

Security-wise not the best solution. But as the "go live" is on Saturday, there is not alternative.

Someone has to beat that software vendort with a stick (preferably a big one). I have no idea on how to produce a software that is sensible to such a small scale re-ordering. It is a "highly optimised" video editing solution.

Duane_Toler · ‎2021-03-18

Wow! That is quite epic. Sounds reasonable, too. I've had to enable fast_accel for a customer recently on R80.30 HFA 226 for a Remote Desktop connection via VPN (both site-to-site and RemoteAccess), but we're using kernel 3.10 (versus your 2.6 kernel). Reading through the thread, looks like Timothy was on the trail earlier, after all. So perhaps this issue lies in the CPAS fw kernel module? That's medium path, which should go to CoreXL (IIRC). Very interesting situation indeed. No doubt TAC will be interested in this!

_Val_ · ‎2020-10-02

I would recommend to open a TAC case.

Martin_Seeger · ‎2020-10-02

Yes, but we want to get a structured debug first. We cannot test ourselves and the error reports come from people doing video edits. So a bit of knowledge transfer is required first.

Are you a member of CheckMates?

TCPdump active: performance increases and problem disappears