Solved: Re: core affinity R80.40 - two cores

D_TK · ‎2020-06-15

did the blink R80.40 upgrade on a 4600 appliance cluster that was previously running R80.20. On 80.20, with corexl enabled, this was the distribution, and RX-DRPs were typically close to zero. :

# sim affinity -l
Mgmt : 0
eth1 : 1
eth2 : 1
eth3 : 0
eth5 : 0

After the blink upgrade, the appliances were in USFW mode which seemed strange for a 2 core box. working with TAC, i changed them back to kernel mode and with corexl enabled, the allocation doesn't change from:

sim affinity -l
Mgmt : 0
eth1 : 0
eth2 : 0
eth3 : 0
eth5 : 0

fw ctl affinity -l -r
CPU 0: eth5 eth1 eth2 eth3 Mgmt
fw_1
CPU 1: fw_0
All: in.asessiond mpdaemon in.acapd usrchkd pepd in.geod rad rtmd fwd lpd vpnd pdpd cpd cprid

and RX-DRPs are accumulating.

I've never had to change corexl from "auto" mode - should i even attempt to balance the interfaces? I haven't heard about any user experience issues, but traffic is pretty light right now with most folks still WFH.

thanks

Timothy_Hall · ‎2020-06-16

Yep sure enough these are zero for both interfaces:

rx_no_buffer_count: 0
rx_missed_errors: 0

So there is no problem with ring buffer misses/overruns. This situation was covered in my book:

Click to Expand

RX-DRP Culprit 1: Unknown or Undesired Protocol Type

In every Ethernet frame is a header field called “EtherType”. This field specifies the OSI
Layer 3 protocol that the Ethernet frame is carrying. A very common value for this
header field is 0x0800, which indicates that the frame is carrying an Internet Protocol
version 4 (IPv4) packet. Look at this excerpt from Stage 6 of “A Millisecond in the life
of a frame”:

Stage 6: At a later time the CPU begins SoftIRQ processing and looks in the ring
buffer. If a descriptor is present, the CPU retrieves the frame from the associated
receive socket buffer, clears the descriptor referencing the frame in the ring
buffer, and sends the frame to all “registered receivers” which will be the
SecureXL acceleration driver. If a tcpdump capture is currently running,
libpcap will also be a “registered receiver” in that case and get a copy of the
frame as well. The SoftIRQ processing continues until all ring buffers are
completely emptied, or various packet count or time limits have been reached.

During hardware interrupt processing, the NIC driver will examine the EtherType
field and verify there is a “registered receiver” present for the protocol specified in the
frame header. If there is not, the frame is discarded and RX-DRP is incremented.

Example: an Ethernet frame arrives with an EtherType of 0x86dd indicating the
presence of IPv6 in the Ethernet frame. If IPv6 has not been enabled on the firewall (it is
off by default), the frame will be discarded by the NIC driver and RX-DRP incremented.

What other protocols are known to cause this effect in the real world? Let’s take a look
at a brief sampling of other possible rogue EtherTypes you may see, that is by no means
complete:

Appletalk (0x809b)
IPX (0x8137 or 0x8138)
Ethernet Flow Control (0x8808) if NIC flow control is disabled
Jumbo Frames (0x8870) if the firewall is not configured to process jumbo frames

The dropping of these protocols for which there is no “registered receiver” does
cause a very small amount of overhead on the firewall during hardware interrupt
processing, but unless the number of frames discarded in this way exceeds 0.1% of all
inbound packets, you probably shouldn’t worry too much about it. An easy way to
confirm that the lack of a registered receiver is the cause of RX-DRPs is to perform the
following test:

1. In a SSH or terminal window, run watch -d netstat -ni and confirm the constant incrementing of RX-DRP on (interface).

2. In a second SSH session, run tcpdump -ni (interface) host 1.1.1.1

Does the near constant incrementing of RX-DRP on that interface suddenly stop as
long as the tcpdump is still running, and resume when the tcpdump is stopped? If so,
the lack of a registered receiver is indeed the cause of the RX-DRPs. The specified filter
expression (host 1.1.1.1 in our example) does not actually matter, since libpcap will

register to receive all protocols on behalf of the running tcpdump, and then filter the
packets based on the provided tcpdump expression. So as long as the tcpdump is
running, there is essentially a registered received for everything.

But how can we find out what these rogue protocols are, and more importantly figure
out where they are coming from? Run this tcpdump command to show every frame not
carrying IPv4 traffic or ARP traffic based on the EtherType header field:

tcpdump -c100 -eni (iface) not ether proto 0x0800 \
and not ether proto 0x0806 and not stp

(Note the ‘\’ at the end of line 1 of this command is a backslash and allows us to
continue the same command on a new line)

RX-DRP Culprit 1: Unknown or Undesired Protocol Type In every Ethernet frame is a header field called “EtherType”. This field specifies the OSILayer 3 protocol that the Ethernet frame is carrying. A very common value for thisheader field is 0x0800, which indicates that the frame is carrying an Internet Protocolversion 4 (IPv4) packet. Look at this excerpt from Stage 6 of “A Millisecond in the lifeof a frame”: Stage 6: At a later time the CPU begins SoftIRQ processing and looks in the ringbuffer. If a descriptor is present, the CPU retrieves the frame from the associatedreceive socket buffer, clears the descriptor referencing the frame in the ringbuffer, and sends the frame to all “registered receivers” which will be theSecureXL acceleration driver. If a tcpdump capture is currently running,libpcap will also be a “registered receiver” in that case and get a copy of theframe as well. The SoftIRQ processing continues until all ring buffers arecompletely emptied, or various packet count or time limits have been reached. During hardware interrupt processing, the NIC driver will examine the EtherTypefield and verify there is a “registered receiver” present for the protocol specified in theframe header. If there is not, the frame is discarded and RX-DRP is incremented. Example: an Ethernet frame arrives with an EtherType of 0x86dd indicating thepresence of IPv6 in the Ethernet frame. If IPv6 has not been enabled on the firewall (it isoff by default), the frame will be discarded by the NIC driver and RX-DRP incremented. What other protocols are known to cause this effect in the real world? Let’s take a lookat a brief sampling of other possible rogue EtherTypes you may see, that is by no meanscomplete: Appletalk (0x809b) IPX (0x8137 or 0x8138) Ethernet Flow Control (0x8808) if NIC flow control is disabled Jumbo Frames (0x8870) if the firewall is not configured to process jumbo frames The dropping of these protocols for which there is no “registered receiver” doescause a very small amount of overhead on the firewall during hardware interruptprocessing, but unless the number of frames discarded in this way exceeds 0.1% of allinbound packets, you probably shouldn’t worry too much about it. An easy way toconfirm that the lack of a registered receiver is the cause of RX-DRPs is to perform thefollowing test: 1. In a SSH or terminal window, run watch -d netstat -ni and confirm the constant incrementing of RX-DRP on (interface). 2. In a second SSH session, run tcpdump -ni (interface) host 1.1.1.1 Does the near constant incrementing of RX-DRP on that interface suddenly stop aslong as the tcpdump is still running, and resume when the tcpdump is stopped? If so,the lack of a registered receiver is indeed the cause of the RX-DRPs. The specified filterexpression (host 1.1.1.1 in our example) does not actually matter, since libpcap will register to receive all protocols on behalf of the running tcpdump, and then filter thepackets based on the provided tcpdump expression. So as long as the tcpdump isrunning, there is essentially a registered received for everything. But how can we find out what these rogue protocols are, and more importantly figureout where they are coming from? Run this tcpdump command to show every frame notcarrying IPv4 traffic or ARP traffic based on the EtherType header field: tcpdump -c100 -eni (iface) not ether proto 0x0800 \and not ether proto 0x0806 and not stp (Note the ‘\’ at the end of line 1 of this command is a backslash and allows us tocontinue the same command on a new line)

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

View solution in original post

_Val_ · ‎2020-06-15

Are you sure you only have two cores on your appliance? I think it has to be 4.

With auto-core assignment, and 4 cores, only one of cores serves as SND.

D_TK · ‎2020-06-16

Yep, the 4600 only has 2 cores.

Timothy_Hall · ‎2020-06-16

There is no automatic interface affinity in R80.40, at least the way it was implemented in earlier releases. On Gaia 3.10 Multi-Queue is enabled for all interfaces (except management) which spreads the SoftIRQ/SecureXL load across all your SND cores (2) in your 2/2 split. It is possible that the interfaces on the 4600 (a pretty old box) are not capable of Multi-Queue and that is why everything is on Core 0 instead.

What does output of expert mode command mq_mng --show reveal? If that command doesn't work try the clish command show interface (interface name) multi-queue.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

D_TK · ‎2020-06-16

Thanks Tim. No surprise i received this message to that command: "No multiqueue supported interfaces available".

This cluster ran very clean on R80.20 - it has under 50 users behind it, and is fed by only a 50M MPLS link.

I attached a netstat -ni, the strangest part is that the interface with the enormous amount of drops (eth5) has the least amount of traffic - it's just the back-up cable modem we use for ISP redundancy. the only traffic on it is ICMP and tunnel-test.

Wondering if i should cap my 2 core 4000 series boxes at R80.30.

Thanks for any feedback.

Timothy_Hall · ‎2020-06-16

Please provide output of ethtool -S eth5 and ethtool -S eth1. Although uncommon, sometimes RX-DRPs are caused not by ring buffer misses/overflows, but by incoming frames carrying unknown protocols that have no registered receiver. This reporting behavior seems to have changed in Gaia 3.10 making it more likely than it was for Gaia 2.6.18.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

D_TK · ‎2020-06-16

ethtool -S eth5
NIC statistics:
rx_packets: 124212
tx_packets: 81970
rx_bytes: 12548006
tx_bytes: 9474872
rx_broadcast: 28189
tx_broadcast: 1660
rx_multicast: 9074
tx_multicast: 2
rx_errors: 0
tx_errors: 0
tx_dropped: 0
multicast: 9074
collisions: 0
rx_length_errors: 0
rx_over_errors: 0
rx_crc_errors: 0
rx_frame_errors: 0
rx_no_buffer_count: 0
rx_missed_errors: 0
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_fifo_errors: 0
tx_heartbeat_errors: 0
tx_window_errors: 0
tx_abort_late_coll: 0
tx_deferred_ok: 0
tx_single_coll_ok: 0
tx_multi_coll_ok: 0
tx_timeout_count: 0
tx_restart_queue: 0
rx_long_length_errors: 0
rx_short_length_errors: 0
rx_align_errors: 0
tx_tcp_seg_good: 0
tx_tcp_seg_failed: 0
rx_flow_control_xon: 0
rx_flow_control_xoff: 0
tx_flow_control_xon: 0
tx_flow_control_xoff: 0
rx_csum_offload_good: 65235
rx_csum_offload_errors: 4
rx_header_split: 0
alloc_rx_buff_failed: 0
tx_smbus: 0
rx_smbus: 0
dropped_smbus: 0
rx_dma_failed: 0
tx_dma_failed: 0
rx_hwtstamp_cleared: 0
uncorr_ecc_errors: 0
corr_ecc_errors: 0
tx_hwtstamp_timeouts: 0
tx_hwtstamp_skipped: 0

ethtool -S eth1
NIC statistics:
rx_packets: 3361399
tx_packets: 3707821
rx_bytes: 1152961595
tx_bytes: 2555983567
rx_broadcast: 195818
tx_broadcast: 26641
rx_multicast: 3054
tx_multicast: 2
rx_errors: 0
tx_errors: 0
tx_dropped: 0
multicast: 3054
collisions: 0
rx_length_errors: 0
rx_over_errors: 0
rx_crc_errors: 0
rx_frame_errors: 0
rx_no_buffer_count: 0
rx_missed_errors: 0
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_fifo_errors: 0
tx_heartbeat_errors: 0
tx_window_errors: 0
tx_abort_late_coll: 0
tx_deferred_ok: 0
tx_single_coll_ok: 0
tx_multi_coll_ok: 0
tx_timeout_count: 0
tx_restart_queue: 0
rx_long_length_errors: 0
rx_short_length_errors: 0
rx_align_errors: 0
tx_tcp_seg_good: 0
tx_tcp_seg_failed: 0
rx_flow_control_xon: 0
rx_flow_control_xoff: 0
tx_flow_control_xon: 0
tx_flow_control_xoff: 0
rx_csum_offload_good: 3150510
rx_csum_offload_errors: 0
rx_header_split: 0
alloc_rx_buff_failed: 0
tx_smbus: 0
rx_smbus: 0
dropped_smbus: 0
rx_dma_failed: 0
tx_dma_failed: 0
rx_hwtstamp_cleared: 0
uncorr_ecc_errors: 0
corr_ecc_errors: 0
tx_hwtstamp_timeouts: 0
tx_hwtstamp_skipped: 0

Thank You.

Timothy_Hall · ‎2020-06-16

Yep sure enough these are zero for both interfaces:

rx_no_buffer_count: 0
rx_missed_errors: 0

So there is no problem with ring buffer misses/overruns. This situation was covered in my book:

Click to Expand

RX-DRP Culprit 1: Unknown or Undesired Protocol Type

In every Ethernet frame is a header field called “EtherType”. This field specifies the OSI
Layer 3 protocol that the Ethernet frame is carrying. A very common value for this
header field is 0x0800, which indicates that the frame is carrying an Internet Protocol
version 4 (IPv4) packet. Look at this excerpt from Stage 6 of “A Millisecond in the life
of a frame”:

Stage 6: At a later time the CPU begins SoftIRQ processing and looks in the ring
buffer. If a descriptor is present, the CPU retrieves the frame from the associated
receive socket buffer, clears the descriptor referencing the frame in the ring
buffer, and sends the frame to all “registered receivers” which will be the
SecureXL acceleration driver. If a tcpdump capture is currently running,
libpcap will also be a “registered receiver” in that case and get a copy of the
frame as well. The SoftIRQ processing continues until all ring buffers are
completely emptied, or various packet count or time limits have been reached.

During hardware interrupt processing, the NIC driver will examine the EtherType
field and verify there is a “registered receiver” present for the protocol specified in the
frame header. If there is not, the frame is discarded and RX-DRP is incremented.

Example: an Ethernet frame arrives with an EtherType of 0x86dd indicating the
presence of IPv6 in the Ethernet frame. If IPv6 has not been enabled on the firewall (it is
off by default), the frame will be discarded by the NIC driver and RX-DRP incremented.

What other protocols are known to cause this effect in the real world? Let’s take a look
at a brief sampling of other possible rogue EtherTypes you may see, that is by no means
complete:

Appletalk (0x809b)
IPX (0x8137 or 0x8138)
Ethernet Flow Control (0x8808) if NIC flow control is disabled
Jumbo Frames (0x8870) if the firewall is not configured to process jumbo frames