Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
CSSBE_Avenger
Participant
Jump to solution

Unexpected fwFullyUtilizedDrops with low CPU – how to identify the source?

We have a 2‑node Check Point cluster and are monitoring the SNMP OID fwFullyUtilizedDrops.
Outside business hours we see ~1000 drops/sec, even though overall CPU usage is low and no process shows high CPU. During business hours, CPU becomes heavily loaded and the drops increase to 5000+/sec.

So far we have checked:

  • No physical interface drops
  • No drops visible with zdebug (sk61143)
  • No drops reported by fwaccel stats

Where else can these fwFullyUtilizedDrops originate?
Is there any way to capture or trace them (tcpdump‑style, zdebug, or other) to determine where the drops are occurring?

0 Kudos
2 Solutions

Accepted Solutions
Timothy_Hall
MVP Gold
MVP Gold

It took me a while to go through your Super Seven outputs, and you have a couple of things going on:

1) What code level is this and on what hardware?  It looks like Hyperflow is enabled, so I would assume at least R81.20?

2) You are in KPPAK mode. Did you explicitly disable UPPAK?  If these gateways were upgraded to R81.20 or R82, the KPPAK mode will remain, whereas a fresh install will enable UPPAK by default, assuming it is a Lightspeed or Quantum Force appliance.

3) My understanding of what "instance is fully utilized" means is that traffic was trying to be sent from an SND to a Firewall Worker for Medium/Slowpath handling, but the incoming CoreXL queue was full, and the traffic was lost.  (sk61143: Traffic is dropped by CoreXL with "fwmultik_inbound_packet_from_dispatcher Reason: Instance...).  This is considered a drop by the SND, and will not be displayed by fw ctl zdebug drop, you must use fw ctl zdebug + drop to see these.  The SK discusses increasing the CoreXL queue size; I do not recommend doing so until a better understanding of the situation has been achieved.

4) There is a high level of Hyperflow boosting going on, which would suggest the presence of many elephant flows.  fw ctl multik print_heavy_conns will show all current elephant flows, as well as all those detected within the past 24 hours.  For elephant flows longer ago, check the log files of the Spike Detective:  sk166454: CPU Spike Detective

5) How are you using the QoS blade?  You have a very large amount of traffic on the QoS paths.  Limits?  Guarantees?  Shaping?  Using this capability may deliberately delay traffic in queues, resulting in full queue drops.  QoS is a very old feature, and Hyperflow is very new; I've never heard of anyone using them together, and how they might interact under high load could be quite unpredictable.  

6) Interface eth4-04 appears to be overloaded and is losing around 1% of its incoming frames; interface eth2-01 is also losing 0.21% which are both above the 0.1% rule of thumb.  This usually means there are not enough SND cores, but you seem to have plenty (20) which may indicate Multi-Queue issues or bumping up against queue limitations for certain interfaces/drivers.  But not all RX-DRPs are legitimate ring buffer misses resulting in frame loss, so please post the output of the following expert mode commands for further analysis:

ethtool -i eth4-04

ethtool -S eth4-04

ethtool -i eth2-01

ethtool -S eth2-01

mq_mng -o -v

7) Would be nice to fix your templating issues shown by fwaccel stat (you're still getting a 4% template cache hit rate even with these issues), as excessive rule base lookups due to accept template cache misses can drive up the CPU on the workers and keep them from emptying their incoming queues fast enough.  Your level of F2F/slowpath traffic is fine.

I'll need to see some more data, but at first glance my gut feel is that disabling either QoS or Hyperflow may help.  One of these is probably trying to slow things down (shape) and the other is trying to make things go much faster.  I'm very unsure about how these two are going to affect each other.

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization

View solution in original post

(1)
Timothy_Hall
MVP Gold
MVP Gold

A 26000 model is not a Quantum Force, so it can't use UPPAK until you upgrade to R82.10 where it is mandatory for all models.  Ironically, by using poll mode instead of interrupt mode on the SNDs to empty ring buffers, UPPAK will solve the exact problem you are having.

Multi-Queue appears to be doing a good job of balancing traffic across the SNDs for eth4-04 and eth2-01.  However because these interfaces are using the i40e driver (which supports up to 64 total queues), and you have a total of 5 active interfaces using the i40e driver (eth2-01, eth3-01, eth3-04, eth4-01, eth4-04) each interface can only have 12 total queues, which means only 12 SNDs can empty each interface's ring buffer, no matter how many SNDs there are.  This is probably why eth4-04 is falling behind with the 1% loss via RX-DRP. I would definitely recommend an 802.3ad LACP Active/Active bond here if possible, with the transmit hash policy set to L3+4 on both sides.  This bond will create a grand total of 20 queues, allowing up to 20 SNDs to help with this heavy load and keep it from falling behind, as well as increased bandwidth for heavy bursts of traffic.

I'm working on a new book right now that will be fully updated and cover situations like this.  For the templating issues, here is the content from the new book documenting the current conditions that cause accept templating to stop:

SecureXL Session Rate Acceleration (Accept Templates)


A rulebase lookup is one of the most computationally expensive operations that a firewall has to perform, although the advent of Column-based Matching described in the last section has reduced the overhead substantially. However, the ability of SecureXL to form Accept Templates (essentially a "cached" rulebase lookup) for repeated, substantially similar connections predates Column-based Matching by many years, but it no longer provides the same level of performance boost it did then.

To check the state of SecureXL Accept Templating, run fwaccel stat. Once an Accept template is created, substantially similar connections for the next 3 minutes can "hit" on the cached entry. Each hit resets the three-minute timer. As long as you see "Accept Templates: enabled" you are good to go, and the entire Firewall/Network policy layer is eligible to have Accept templates created for it (usually...unless the indicated templating match rate always seems to be zero...more on that bit later!):

temp1.png

 
 

But what if you see something like this:

temp2.png

 
This output indicates that a condition exists in rule 5 of the rule base, preventing the formation of Accept Templates for rule #5 and all subsequent rules. This situation DOES NOT affect which processing path (i.e., throughput acceleration) traffic will be handled in, only the "caching" of rule base lookups and formation of templates for rule 5 and beyond.  In this example, traffic matching rule #10 could possibly still be offloaded to the Medium Path or fastpath (either the software/sim or Lightspeed/hardware fastpath).  Note that, in addition to the SecureXL Accept templates covered earlier, there are also SecureXL NAT templates, which follow the state of Accept templates in terms of rulebase optimizations.  NAT Templates will be covered later.
 
The list of rulebase situations that could cause this to occur used to be very long (especially in R77.30 and earlier), but has become significantly shorter over the years.  
 
Rulebase Conditions That Can Still Disable Accept Templating in R80.10+

• Use of a DCE/RPC object in a rule (the most common cause)
• Use of legacy Global DHCP services in a rule – see sk162544: SecureXL Templates Disabled by 'gdhcp' Related Services
• Use of a service in a rule that calls for a specific SOURCE port number or range to be matched in addition to a destination port (not common), or has the checkbox "Enable reply from any port" set
• Use of a service of type "Other" (two-way arrow icon) that invokes raw INSPECT code routines in the Advanced service properties (with the exception of service traceroute & local DHCP services that have a special workaround). An example:
 
temp3.png
 
• Enabling IPS Protections Small PMTU, Network Quota, or ISN Spoofing will kill all templating for the whole rule base
• Rules utilizing legacy Resource objects or Authentication actions (User Auth/Client Auth/Session Auth)

If one of these conditions is present in a particular rule, see if the rule can be revised or removed. If it can’t, try to move the offending rule as far down the Firewall/Network policy as possible. The current percentage of connections that have matched a SecureXL connection template can be checked on the first line of output for fwaccel stats -s:
 
temp4.png
 
New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization

View solution in original post

0 Kudos
(1)
9 Replies
simonemantovani
CSSBE_Avenger
Participant

I've run the 7 tools and I post the output in a file.
if you're able to take a look and give me a feedback of what you see, it'll be very appreciated.


0 Kudos
PhoneBoy
Admin
Admin

While the majority of packets are accelerated, the vast majority of connections are NOT.
The output notes several rules that are disabling templates.
They should be reviewed as this might cause temporary issues with CPU utilization. 

Timothy_Hall
MVP Gold
MVP Gold

It took me a while to go through your Super Seven outputs, and you have a couple of things going on:

1) What code level is this and on what hardware?  It looks like Hyperflow is enabled, so I would assume at least R81.20?

2) You are in KPPAK mode. Did you explicitly disable UPPAK?  If these gateways were upgraded to R81.20 or R82, the KPPAK mode will remain, whereas a fresh install will enable UPPAK by default, assuming it is a Lightspeed or Quantum Force appliance.

3) My understanding of what "instance is fully utilized" means is that traffic was trying to be sent from an SND to a Firewall Worker for Medium/Slowpath handling, but the incoming CoreXL queue was full, and the traffic was lost.  (sk61143: Traffic is dropped by CoreXL with "fwmultik_inbound_packet_from_dispatcher Reason: Instance...).  This is considered a drop by the SND, and will not be displayed by fw ctl zdebug drop, you must use fw ctl zdebug + drop to see these.  The SK discusses increasing the CoreXL queue size; I do not recommend doing so until a better understanding of the situation has been achieved.

4) There is a high level of Hyperflow boosting going on, which would suggest the presence of many elephant flows.  fw ctl multik print_heavy_conns will show all current elephant flows, as well as all those detected within the past 24 hours.  For elephant flows longer ago, check the log files of the Spike Detective:  sk166454: CPU Spike Detective

5) How are you using the QoS blade?  You have a very large amount of traffic on the QoS paths.  Limits?  Guarantees?  Shaping?  Using this capability may deliberately delay traffic in queues, resulting in full queue drops.  QoS is a very old feature, and Hyperflow is very new; I've never heard of anyone using them together, and how they might interact under high load could be quite unpredictable.  

6) Interface eth4-04 appears to be overloaded and is losing around 1% of its incoming frames; interface eth2-01 is also losing 0.21% which are both above the 0.1% rule of thumb.  This usually means there are not enough SND cores, but you seem to have plenty (20) which may indicate Multi-Queue issues or bumping up against queue limitations for certain interfaces/drivers.  But not all RX-DRPs are legitimate ring buffer misses resulting in frame loss, so please post the output of the following expert mode commands for further analysis:

ethtool -i eth4-04

ethtool -S eth4-04

ethtool -i eth2-01

ethtool -S eth2-01

mq_mng -o -v

7) Would be nice to fix your templating issues shown by fwaccel stat (you're still getting a 4% template cache hit rate even with these issues), as excessive rule base lookups due to accept template cache misses can drive up the CPU on the workers and keep them from emptying their incoming queues fast enough.  Your level of F2F/slowpath traffic is fine.

I'll need to see some more data, but at first glance my gut feel is that disabling either QoS or Hyperflow may help.  One of these is probably trying to slow things down (shape) and the other is trying to make things go much faster.  I'm very unsure about how these two are going to affect each other.

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization
(1)
CSSBE_Avenger
Participant

Thank you both for your answers.

This morning, we disabled QoS, and the CPU load stabilized after this modification.
However, based on all our research and the additional information you provided, we need to take our optimization efforts to the next level.

To answer your question properly:
We have two 26000 appliances in a ClusterXL configuration running R81.20 Take 120.


Honestly, I had never heard of KPPAK and UPPAK before, so I will read up on them.
We initially deployed our environment on R81.10 three years ago, so based on what you said, we should be using KPPAK.

The QoS blade was activated last summer to guarantee bandwidth for our telephony environment, which recently transitioned to a full SIP trunk, ensuring a minimum level of service.

I have attached the output of the five commands you suggested running.

Finally, I will work on our policy to improve the conditions for the acceleration template to function more effectively.

 

Thank you again for your help.

 

P.S. We bought your book last week.

0 Kudos
Timothy_Hall
MVP Gold
MVP Gold

A 26000 model is not a Quantum Force, so it can't use UPPAK until you upgrade to R82.10 where it is mandatory for all models.  Ironically, by using poll mode instead of interrupt mode on the SNDs to empty ring buffers, UPPAK will solve the exact problem you are having.

Multi-Queue appears to be doing a good job of balancing traffic across the SNDs for eth4-04 and eth2-01.  However because these interfaces are using the i40e driver (which supports up to 64 total queues), and you have a total of 5 active interfaces using the i40e driver (eth2-01, eth3-01, eth3-04, eth4-01, eth4-04) each interface can only have 12 total queues, which means only 12 SNDs can empty each interface's ring buffer, no matter how many SNDs there are.  This is probably why eth4-04 is falling behind with the 1% loss via RX-DRP. I would definitely recommend an 802.3ad LACP Active/Active bond here if possible, with the transmit hash policy set to L3+4 on both sides.  This bond will create a grand total of 20 queues, allowing up to 20 SNDs to help with this heavy load and keep it from falling behind, as well as increased bandwidth for heavy bursts of traffic.

I'm working on a new book right now that will be fully updated and cover situations like this.  For the templating issues, here is the content from the new book documenting the current conditions that cause accept templating to stop:

SecureXL Session Rate Acceleration (Accept Templates)


A rulebase lookup is one of the most computationally expensive operations that a firewall has to perform, although the advent of Column-based Matching described in the last section has reduced the overhead substantially. However, the ability of SecureXL to form Accept Templates (essentially a "cached" rulebase lookup) for repeated, substantially similar connections predates Column-based Matching by many years, but it no longer provides the same level of performance boost it did then.

To check the state of SecureXL Accept Templating, run fwaccel stat. Once an Accept template is created, substantially similar connections for the next 3 minutes can "hit" on the cached entry. Each hit resets the three-minute timer. As long as you see "Accept Templates: enabled" you are good to go, and the entire Firewall/Network policy layer is eligible to have Accept templates created for it (usually...unless the indicated templating match rate always seems to be zero...more on that bit later!):

temp1.png

 
 

But what if you see something like this:

temp2.png

 
This output indicates that a condition exists in rule 5 of the rule base, preventing the formation of Accept Templates for rule #5 and all subsequent rules. This situation DOES NOT affect which processing path (i.e., throughput acceleration) traffic will be handled in, only the "caching" of rule base lookups and formation of templates for rule 5 and beyond.  In this example, traffic matching rule #10 could possibly still be offloaded to the Medium Path or fastpath (either the software/sim or Lightspeed/hardware fastpath).  Note that, in addition to the SecureXL Accept templates covered earlier, there are also SecureXL NAT templates, which follow the state of Accept templates in terms of rulebase optimizations.  NAT Templates will be covered later.
 
The list of rulebase situations that could cause this to occur used to be very long (especially in R77.30 and earlier), but has become significantly shorter over the years.  
 
Rulebase Conditions That Can Still Disable Accept Templating in R80.10+

• Use of a DCE/RPC object in a rule (the most common cause)
• Use of legacy Global DHCP services in a rule – see sk162544: SecureXL Templates Disabled by 'gdhcp' Related Services
• Use of a service in a rule that calls for a specific SOURCE port number or range to be matched in addition to a destination port (not common), or has the checkbox "Enable reply from any port" set
• Use of a service of type "Other" (two-way arrow icon) that invokes raw INSPECT code routines in the Advanced service properties (with the exception of service traceroute & local DHCP services that have a special workaround). An example:
 
temp3.png
 
• Enabling IPS Protections Small PMTU, Network Quota, or ISN Spoofing will kill all templating for the whole rule base
• Rules utilizing legacy Resource objects or Authentication actions (User Auth/Client Auth/Session Auth)

If one of these conditions is present in a particular rule, see if the rule can be revised or removed. If it can’t, try to move the offending rule as far down the Firewall/Network policy as possible. The current percentage of connections that have matched a SecureXL connection template can be checked on the first line of output for fwaccel stats -s:
 
temp4.png
 
New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization
0 Kudos
(1)
CSSBE_Avenger
Participant

Thank you for your suggestions.

We would like to provide a quick update on our side. Like I already said, disabling QoS immediately brought the CPU back to a normal level, which confirms that QoS was contributing to the high CPU usage. However, the fwFullyUtilizedDrops counter only decreased by about half, and we still consider the remaining value abnormally high. This point remains the main concern of our original question.

In addition, we are still seeing an unusually high number of “first packet isn’t SYN” events (about 80 drops/sec). This is a long‑standing issue in our environment, and it could to be related to the behavior we are observing with the drops.

At the moment, fwFullyUtilizedDrops are around 2000 drops/sec, while the drops reported by fw ctl zdebug are approximately 500 drops/sec. Regardless of whether we run 

fw ctl zdebug drop
 
or 
 
fw ctl zdebug + drop
 

we do not see the fwFullyUtilizedDrops reflected in the zdebug output.

We will proceed with implementing all the recommendations you suggested (or as much as possible) and will report back in this forum once everything is in place and we have results to share.

Thank you again for your help and time.

Best regards.

0 Kudos
CSSBE_Avenger
Participant

Hello,

We have corrected the accept templates configuration. It is now set to Enabled:

2026-03-20_09h58_12.png

Result: the number of fwFullyUtilizedDrops did not decrease. We are still observing approximately 2000–2500 drops/sec during peak production hours.

Before moving forward with interface bonding (802.3ad LACP Active/Active), we wanted to ensure that there was a real correlation between the number of NIC queues and the fwFullyUtilizedDrops counter.
To validate this, we deliberately reduced the number of queues by half on our two highest packet-rate interfaces.

Before the configuration change, we had the following setup:

2026-03-20_09h54_02.png

After the change, the configuration was:

2026-03-20_09h56_38.png

Result: the number of fwFullyUtilizedDrops did not increase, contrary to what we would have expected if the number of queues were directly correlated with this counter.

Based on this outcome, we currently conclude that implementing bonding (802.3ad LACP Active/Active) to increase the number of queues should not reduce the observed drops — unless we are missing an important aspect of how this counter works internally.

At this time, the only action that has resulted in a noticeable reduction of fwFullyUtilizedDrops was disabling the QoS blade.

For additional context:

  • cpview (Overview section) reports approximately 350,000 packets/sec
  • fwFullyUtilizedDrops is around 2500/sec, which represents close to 1% packet loss

This leads us to question our understanding of the fwFullyUtilizedDrops counter itself.
Are these actual packet drops, even though there is no clear way to identify where in the datapath the drops are occurring?

We are planning to upgrade to R82 in the coming weeks, and we will observe whether this improves the situation.

Thank you for your time and for any additional insight you may be able to provide.

Best regards,

0 Kudos
CSSBE_Avenger
Participant

After extensive troubleshooting, we finally identified the root cause — and it turned out to be a monitoring configuration error on our end.

We had assigned OID iso.3.6.1.4.1.2620.1.1.25.13.0 (fwLoggedTotal) to our Zabbix item instead of iso.3.6.1.4.1.2620.1.1.25.26.0 (fwFullyUtilizedDrops). We were never actually observing fwFullyUtilizedDrops at all — we were monitoring the total number of logged connections.

In hindsight, the most obvious clue was right there from the beginning: the drops were completely invisible in zdebug drop and in SmartConsole logs. That inconsistency should have immediately pointed to a measurement problem rather than an actual performance issue.

We want to thank Timothy Hall and Bob Zimmerman for their thorough and technically sound analysis of our CoreXL/SMT configuration. Even though it was based on a false premise, the findings regarding SND/Worker core overlap on our dual-socket setup are real and worth addressing independently.

Lesson learned: always validate your data source before analyzing the data. And also, don't trust blindly mister AI ...

Closing this thread as resolved.

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events