RX-DRP/rx_missed_errors on Interface

Jason_Carrillo · ‎2019-03-18

We have a cluster of Open Hardware R80.10 systems that have some decent numbers of RX-DRPs showing up when we run netstat -ni. We are seeing that those numbers coincide with rx_missed_errors when we run the ethtool command. Output of netstat ni and and ethtool -S output for that gateway.

Output of ethtool -i:

1:0]# ethtool -i eth0
driver: ixgbe
version: 3.9.15-NAPI
firmware-version: 0x61bd0001
bus-info: 0000:04:00.0

I've got 3 other similarly outfitted clusters that don't see errors on the same scale but they are no where near as busy as this main gateway.

The volume of these errors is pretty small, 0.0087%, and as far as we can tell, there aren't any issues being caused by it, but the presence of these errors prevents us from being able to run Optimized Drops on this wall. When we turn on Optimized Drops, SecureXL crashes after these errors accrue and then our CPU goes up across all the cores.

When it comes to acceleration we rely on it quite a bit to keep stuff moving:

[Expert@FW1:0]# fwaccel stats -s
Accelerated conns/Total conns : 266989/361053 (73%)
Accelerated pkts/Total pkts : 1334142802/1410322803 (94%)
F2Fed pkts/Total pkts : 41668137/1410322803 (2%)
PXL pkts/Total pkts : 34511864/1410322803 (2%)
QXL pkts/Total pkts : 0/1410322803 (0%)
[Expert@FW-1:0]#

Any input is appreciated, just trying to figure out if there is an easy fix for this or if there is something more sinister going on underneath.

HeikoAnkenbrand · ‎2019-03-18

RX Error counters are incremented by frames received by the NIC that are corrupted in some way:

Possible duplex mismatch on both interfaces of the link.
Faulty NIC, cable, physical media issue.
CRC failures.
In addition, NIC speed / duplex mis-match with the connecting port on the switch/router might be the cause.

RX Drops in excess of 0.1% - 0.5% (it is recommended to err on the smaller value) of total transmitted packets are indicative of an issue. Anything below these % values can just be random errors / trivial effect on performance.

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

HeikoAnkenbrand · ‎2019-03-18

I think 0.0087% shoud not an issue.

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

HeikoAnkenbrand · ‎2019-03-18

Or see SK:

Excessive RX Errors / RX Drops found on interface

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

christian_konne · ‎2019-03-18

I think also 0,0087% is not a problem.

Timothy_Hall · ‎2019-03-18

That RX-DRP percentage is quite low and really nothing to worry about. I highly doubt that RX-DRPs are interfering with the SecureXL Optimized Drop function as they are two completely different things, but in general I'm not a fan of enabling optimized drops unless you need them as it can lead to complications with SecureXL like the ones you are experiencing.

Since it looks like you have a high percentage of fully-accelerated traffic (SXL path), you almost certainly need to reduce the number of kernel instances via cpconfig, thus adding more SND/IRQ instances to help handle all the fully accelerated traffic. An rx_missed_error indicates no ring buffer slot is available for an incoming frame, which is mainly caused by not enough SND/IRQ CPU resources being available to empty it in a timely fashion. Please provide the output of the following commands and I can recommend how many additional SND/IRQ cores you should allocate:

fwaccel stat

fw ctl affinity -l -r

free -m

grep -c ^processor /proc/cpuinfo

sim affinity -l

enabled_blades

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Jason_Carrillo · ‎2019-03-19

"I highly doubt that RX-DRPs are interfering with the SecureXL Optimized Drop function as they are two completely different things,"

I'll see if I can find the ticket, but that is what TAC told me. Turning off Optimized Drops on this particular cluster fixed the issue. I get what you are saying though.

I attached the output requested and I think you might be on to something because I am only allocating two cores to the SNDs on this cluster.

Are you a member of CheckMates?

RX-DRP/rx_missed_errors on Interface