Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
kkiat98
Explorer
Jump to solution

Unable to find reason for RX_DRP

Hello guys,

Our firewall is a CP6600 with 6-core. Recently we are observing high RX_DRP counts on one 10G interface when traffic across interface starts reaching 5-6G. No RX_DROP on other interfaces.

solarwinds_drop.png

 check

What we have observe:

1. From cpview -t, CPU load can reach 80+% at peak traffic (packet count of ~800K), where we see fastest RX_DRP count. We also observe RX_DRP count increments when CPU is working at 40-60% load but at a lower rate. 

2. RX_OVR is 0, which I believe indicates that we are not lacking ring-size buffer on the interface. It is set to max buffer 4096 anyways.

3. Error is 0 under ethtool -S.

> ethtool -S eth1-01 | grep -i error
rx_errors: 0
tx_errors: 0
rx_length_errors: 0
rx_crc_errors: 0
veb.tx_errors: 0
port.tx_errors: 0
port.rx_crc_errors: 0
port.rx_length_errors: 0

<snip>

3. We suspected that it could be due to software buffer but 2nd column of softnet_stat is 0.

> cat /proc/net/softnet_stat
788b4e9b 00000000 00000460 00000000 00000000 00000000 00000000 00000000 00000000 00000000
d5685693 00000000 000007d9 00000000 00000000 00000000 00000000 00000000 00000000 00000000
0267b0ff 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
02661ccc 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
026d8472 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
02583552 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

 

Can someone suggest other possible causea for this issue and how to check? So far Checkpoint support hasn't been able to pin down the issue. 

Thanks in advance.

0 Kudos
1 Solution

Accepted Solutions
Timothy_Hall
MVP Gold
MVP Gold

Based on that RX-DRP pattern in the sar output, it does indeed appear to be load-based RX-DRPs and not just constant garbage traffic.

Whether RX-OVR and RX-DRP increment in "lockstep" depends on the driver and NIC hardware.  When the NIC attempts to DMA a frame from its hardware buffer into the Gaia ring buffer in RAM, for some NICs if the ring buffer is full, the frame is simply lost as what I call a "buffering miss," and only RX-DRP is incremented. 

However when more advanced NICs encounter this situation, they pull the frame back without losing it and try again later.  But if the ring buffer is constantly full, eventually frames start piling up in the hardware buffer on the NIC itself.  If it eventually overflows at the NIC hardware level, both RX-OVR and RX-DRP are incremented in lockstep to indicate that the overflow was indirectly caused by a full ring buffer, not by the NIC itself being unable to keep up with the incoming frame rate which is the classic cause of an RX-OVR.

A bond may help by adding another interface and ring buffer, but Dynamic Split should take care of this unless you are bumping against queue limitations for some NIC driver hardware, or there is not enough overall spare computing capacity to add more SND cores. 

Do you have Dynamic Split enabled?  (show dynamic-balancing state or the expert mode command dynamic_balancing -p).  

Next step is to ensure Multi-Queue has not been messed with for that interface, post output of  mq_mng –o –v.  

Another thing to check is how well Multi-Queue is balancing the traffic across the SNDs by paying a visit to the Advanced...SecureXL...Network-per-CPU screen in cpview.

Any chance you have a lot of incoming VPN traffic on this interface?  There is a known problem with some NIC drivers that causes severe SND imbalances that is solved by UPPAK: sk183525: High CPU usage on one SND core.

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization

View solution in original post

0 Kudos
10 Replies
emmap
MVP Gold CHKP MVP Gold CHKP
MVP Gold CHKP

Have you seen this article?

https://support.checkpoint.com/results/sk/sk166424

It may be that such drops are normal 'background noise'.

0 Kudos
Timothy_Hall
MVP Gold
MVP Gold

Please post the full output of ethtool -S for the interface in question; be on the lookout for nonzero counters containing "fifo" or "missed" which are legitimate ring buffer misses resulting in dropped frames.  But it is likely "junk" traffic that is incrementing the RX-DRP counter, such as invalid EtherTypes or improperly pruned VLAN tags.

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization
0 Kudos
kkiat98
Explorer

Hi @Timothy_Hall and @emmap , I cannot find "fifo" or "missed" in my output.

Refering to the article shared, I did a 30s pcap at peak traffic to check on the possible causes suggested for drops. 

  • The softnet backlog is full >> cat /proc/net/softnet_stat 2nd column is 0
  • Ethernet frames are received with bad VLAN tags >> It is an access port
  • Packets are received with unknown or unregistered protocols >> EtherType seen in pcap is 0x8100 only
  • IPv6 packets are received while IPv6 is disabled in Gaia >> Only IPv4 seen in PCAP, plus, our application does not use IPv6
  • MTU size mismatch, CRC errors, speed or duplex mismatch >> No errors seen in ethtool

0 Kudos
Timothy_Hall
MVP Gold
MVP Gold

What tool did you use for the pcap?  fw monitor will not show you packets that are RX-DRPed.  tcpdump will show you those due to promiscuous mode, but I'm not sure about cppcap.

Please characterize how the RX-DRP counter is being incremented.  sar -n EDEV can help with this for the last 30 days.  Are they steadily going up over time?  If so, the traffic is probably garbage.  If the counter is going up constantly, does it stop for as long as you are running a tcpdump on that interface (the filter does not matter, but specify a good one anyway to keep from bogging down the firewall), and then start incrementing again after stopping the tcpdump?

If the counter is occasionally going up in big clumps (sar -n EDEV can help), those may be legitimate ring buffer misses.

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization
0 Kudos
kkiat98
Explorer

I am using an external tool to take the pcap for traffic to/from that interface.

Regarding your other question, this is the sar output I got the other day. It corresponds to the daily traffic pattern. So could it a bug that I am not seeing RX-OVR despite it being a ring buffer miss? From my understanding, for ring buffer drops we usually see high RX-OVR counts. 

[Expert@tkdextf1:0]# sar -n EDEV | grep eth1-01
00:10:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
00:20:02 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
00:30:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
00:40:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
00:50:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
01:00:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
01:10:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
01:20:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
01:30:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
01:40:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
01:50:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:00:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:10:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:20:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:30:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:40:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
02:50:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
03:00:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
03:10:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
03:20:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
03:30:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
03:40:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
03:50:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
04:00:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
04:10:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
04:20:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
04:30:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
04:40:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
04:50:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:00:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:10:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:20:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:30:02 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:40:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:50:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
06:00:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
06:10:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
06:20:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
06:30:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
06:40:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
06:50:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
07:00:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
07:10:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
07:20:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
07:30:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
07:40:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
07:50:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
08:00:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
08:10:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
08:20:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
08:30:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
08:40:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
08:50:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
09:00:01 eth1-01 0.00 0.00 0.00 1.44 0.00 0.00 0.00 0.00 0.00
09:10:01 eth1-01 0.00 0.00 0.00 14.42 0.00 0.00 0.00 0.00 0.00
09:20:02 eth1-01 0.00 0.00 0.00 10.71 0.00 0.00 0.00 0.00 0.00
09:30:01 eth1-01 0.00 0.00 0.00 11.42 0.00 0.00 0.00 0.00 0.00
09:40:01 eth1-01 0.00 0.00 0.00 10.73 0.00 0.00 0.00 0.00 0.00
09:50:01 eth1-01 0.00 0.00 0.00 9.88 0.00 0.00 0.00 0.00 0.00
10:00:01 eth1-01 0.00 0.00 0.00 6.45 0.00 0.00 0.00 0.00 0.00
10:10:01 eth1-01 0.00 0.00 0.00 7.28 0.00 0.00 0.00 0.00 0.00
10:20:01 eth1-01 0.00 0.00 0.00 6.33 0.00 0.00 0.00 0.00 0.00
10:30:01 eth1-01 0.00 0.00 0.00 9.49 0.00 0.00 0.00 0.00 0.00
10:40:01 eth1-01 0.00 0.00 0.00 20.69 0.00 0.00 0.00 0.00 0.00
10:50:02 eth1-01 0.00 0.00 0.00 8.62 0.00 0.00 0.00 0.00 0.00
11:00:01 eth1-01 0.00 0.00 0.00 5.34 0.00 0.00 0.00 0.00 0.00
11:10:01 eth1-01 0.00 0.00 0.00 1.38 0.00 0.00 0.00 0.00 0.00
11:20:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:30:01 eth1-01 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00
11:40:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:50:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
12:00:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
12:10:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
12:20:01 eth1-01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
12:30:01 eth1-01 0.00 0.00 0.00 0.05 0.00 0.00 0.00 0.00 0.00
Average: eth1-01 0.00 0.00 0.00 1.66 0.00 0.00 0.00 0.00 0.00

0 Kudos
kkiat98
Explorer

Hi, it is an external tool that was used for the pcap. 

Regarding the drop patterns, it corresponds to the daily traffic pattern we have. Attached today's sar output for reference. 

Aren't ring buffer misses accompanied by increments in RX-OVR? Is that not always the case? One of the solutions from Checkpoint is to try bonding up two interfaces. We will schedule a change for that.

0 Kudos
Timothy_Hall
MVP Gold
MVP Gold

Based on that RX-DRP pattern in the sar output, it does indeed appear to be load-based RX-DRPs and not just constant garbage traffic.

Whether RX-OVR and RX-DRP increment in "lockstep" depends on the driver and NIC hardware.  When the NIC attempts to DMA a frame from its hardware buffer into the Gaia ring buffer in RAM, for some NICs if the ring buffer is full, the frame is simply lost as what I call a "buffering miss," and only RX-DRP is incremented. 

However when more advanced NICs encounter this situation, they pull the frame back without losing it and try again later.  But if the ring buffer is constantly full, eventually frames start piling up in the hardware buffer on the NIC itself.  If it eventually overflows at the NIC hardware level, both RX-OVR and RX-DRP are incremented in lockstep to indicate that the overflow was indirectly caused by a full ring buffer, not by the NIC itself being unable to keep up with the incoming frame rate which is the classic cause of an RX-OVR.

A bond may help by adding another interface and ring buffer, but Dynamic Split should take care of this unless you are bumping against queue limitations for some NIC driver hardware, or there is not enough overall spare computing capacity to add more SND cores. 

Do you have Dynamic Split enabled?  (show dynamic-balancing state or the expert mode command dynamic_balancing -p).  

Next step is to ensure Multi-Queue has not been messed with for that interface, post output of  mq_mng –o –v.  

Another thing to check is how well Multi-Queue is balancing the traffic across the SNDs by paying a visit to the Advanced...SecureXL...Network-per-CPU screen in cpview.

Any chance you have a lot of incoming VPN traffic on this interface?  There is a known problem with some NIC drivers that causes severe SND imbalances that is solved by UPPAK: sk183525: High CPU usage on one SND core.

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization
0 Kudos
kkiat98
Explorer

No, we do not have VPN across this interface. 

😞 dynamic-balancing is not enabled. But we don't have the luxury to allocate more cores for SND to avoid further unexpected impacts. Hence, I have skipped that suggestion from Checkpoint. And since the interface eth1-01 is already using all the available queues and cpview network-per-cpu seems to be quite balanced, I think it's just that our firewall is too weak for our current use case.

Thank you all for your assistance.

 

0 Kudos
Chris_Atkinson
MVP Platinum CHKP MVP Platinum CHKP
MVP Platinum CHKP

Is this interface part of a bond and what hashing is used on either side?

CCSM R77/R80/ELITE
0 Kudos
kkiat98
Explorer

Nope. It is a physical interface. One of checkpoint's suggestion is to bond it and see if that helps with the drops. We might give that a try.

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events