- Products
- Learn
- Local User Groups
- Partners
- More
AI Security Masters E7:
How CPR Broke ChatGPT's Isolation and What It Means for You
Blueprint Architecture for Securing
The AI Factory & AI Data Center
Call For Papers
Your Expertise. Our Stage
Good, Better, Best:
Prioritizing Defenses Against Credential Abuse
Ink Dragon: A Major Nation-State Campaign
Watch HereCheckMates Go:
CheckMates Fest
We have a 2‑node Check Point cluster and are monitoring the SNMP OID fwFullyUtilizedDrops.
Outside business hours we see ~1000 drops/sec, even though overall CPU usage is low and no process shows high CPU. During business hours, CPU becomes heavily loaded and the drops increase to 5000+/sec.
So far we have checked:
Where else can these fwFullyUtilizedDrops originate?
Is there any way to capture or trace them (tcpdump‑style, zdebug, or other) to determine where the drops are occurring?
It took me a while to go through your Super Seven outputs, and you have a couple of things going on:
1) What code level is this and on what hardware? It looks like Hyperflow is enabled, so I would assume at least R81.20?
2) You are in KPPAK mode. Did you explicitly disable UPPAK? If these gateways were upgraded to R81.20 or R82, the KPPAK mode will remain, whereas a fresh install will enable UPPAK by default, assuming it is a Lightspeed or Quantum Force appliance.
3) My understanding of what "instance is fully utilized" means is that traffic was trying to be sent from an SND to a Firewall Worker for Medium/Slowpath handling, but the incoming CoreXL queue was full, and the traffic was lost. (sk61143: Traffic is dropped by CoreXL with "fwmultik_inbound_packet_from_dispatcher Reason: Instance...). This is considered a drop by the SND, and will not be displayed by fw ctl zdebug drop, you must use fw ctl zdebug + drop to see these. The SK discusses increasing the CoreXL queue size; I do not recommend doing so until a better understanding of the situation has been achieved.
4) There is a high level of Hyperflow boosting going on, which would suggest the presence of many elephant flows. fw ctl multik print_heavy_conns will show all current elephant flows, as well as all those detected within the past 24 hours. For elephant flows longer ago, check the log files of the Spike Detective: sk166454: CPU Spike Detective
5) How are you using the QoS blade? You have a very large amount of traffic on the QoS paths. Limits? Guarantees? Shaping? Using this capability may deliberately delay traffic in queues, resulting in full queue drops. QoS is a very old feature, and Hyperflow is very new; I've never heard of anyone using them together, and how they might interact under high load could be quite unpredictable.
6) Interface eth4-04 appears to be overloaded and is losing around 1% of its incoming frames; interface eth2-01 is also losing 0.21% which are both above the 0.1% rule of thumb. This usually means there are not enough SND cores, but you seem to have plenty (20) which may indicate Multi-Queue issues or bumping up against queue limitations for certain interfaces/drivers. But not all RX-DRPs are legitimate ring buffer misses resulting in frame loss, so please post the output of the following expert mode commands for further analysis:
ethtool -i eth4-04
ethtool -S eth4-04
ethtool -i eth2-01
ethtool -S eth2-01
mq_mng -o -v
7) Would be nice to fix your templating issues shown by fwaccel stat (you're still getting a 4% template cache hit rate even with these issues), as excessive rule base lookups due to accept template cache misses can drive up the CPU on the workers and keep them from emptying their incoming queues fast enough. Your level of F2F/slowpath traffic is fine.
I'll need to see some more data, but at first glance my gut feel is that disabling either QoS or Hyperflow may help. One of these is probably trying to slow things down (shape) and the other is trying to make things go much faster. I'm very unsure about how these two are going to affect each other.
A 26000 model is not a Quantum Force, so it can't use UPPAK until you upgrade to R82.10 where it is mandatory for all models. Ironically, by using poll mode instead of interrupt mode on the SNDs to empty ring buffers, UPPAK will solve the exact problem you are having.
Multi-Queue appears to be doing a good job of balancing traffic across the SNDs for eth4-04 and eth2-01. However because these interfaces are using the i40e driver (which supports up to 64 total queues), and you have a total of 5 active interfaces using the i40e driver (eth2-01, eth3-01, eth3-04, eth4-01, eth4-04) each interface can only have 12 total queues, which means only 12 SNDs can empty each interface's ring buffer, no matter how many SNDs there are. This is probably why eth4-04 is falling behind with the 1% loss via RX-DRP. I would definitely recommend an 802.3ad LACP Active/Active bond here if possible, with the transmit hash policy set to L3+4 on both sides. This bond will create a grand total of 20 queues, allowing up to 20 SNDs to help with this heavy load and keep it from falling behind, as well as increased bandwidth for heavy bursts of traffic.
I'm working on a new book right now that will be fully updated and cover situations like this. For the templating issues, here is the content from the new book documenting the current conditions that cause accept templating to stop:
SecureXL Session Rate Acceleration (Accept Templates)
A rulebase lookup is one of the most computationally expensive operations that a firewall has to perform, although the advent of Column-based Matching described in the last section has reduced the overhead substantially. However, the ability of SecureXL to form Accept Templates (essentially a "cached" rulebase lookup) for repeated, substantially similar connections predates Column-based Matching by many years, but it no longer provides the same level of performance boost it did then.
To check the state of SecureXL Accept Templating, run fwaccel stat. Once an Accept template is created, substantially similar connections for the next 3 minutes can "hit" on the cached entry. Each hit resets the three-minute timer. As long as you see "Accept Templates: enabled" you are good to go, and the entire Firewall/Network policy layer is eligible to have Accept templates created for it (usually...unless the indicated templating match rate always seems to be zero...more on that bit later!):
But what if you see something like this:
Try to look at this thread, if could be useful for you:
While the majority of packets are accelerated, the vast majority of connections are NOT.
The output notes several rules that are disabling templates.
They should be reviewed as this might cause temporary issues with CPU utilization.
It took me a while to go through your Super Seven outputs, and you have a couple of things going on:
1) What code level is this and on what hardware? It looks like Hyperflow is enabled, so I would assume at least R81.20?
2) You are in KPPAK mode. Did you explicitly disable UPPAK? If these gateways were upgraded to R81.20 or R82, the KPPAK mode will remain, whereas a fresh install will enable UPPAK by default, assuming it is a Lightspeed or Quantum Force appliance.
3) My understanding of what "instance is fully utilized" means is that traffic was trying to be sent from an SND to a Firewall Worker for Medium/Slowpath handling, but the incoming CoreXL queue was full, and the traffic was lost. (sk61143: Traffic is dropped by CoreXL with "fwmultik_inbound_packet_from_dispatcher Reason: Instance...). This is considered a drop by the SND, and will not be displayed by fw ctl zdebug drop, you must use fw ctl zdebug + drop to see these. The SK discusses increasing the CoreXL queue size; I do not recommend doing so until a better understanding of the situation has been achieved.
4) There is a high level of Hyperflow boosting going on, which would suggest the presence of many elephant flows. fw ctl multik print_heavy_conns will show all current elephant flows, as well as all those detected within the past 24 hours. For elephant flows longer ago, check the log files of the Spike Detective: sk166454: CPU Spike Detective
5) How are you using the QoS blade? You have a very large amount of traffic on the QoS paths. Limits? Guarantees? Shaping? Using this capability may deliberately delay traffic in queues, resulting in full queue drops. QoS is a very old feature, and Hyperflow is very new; I've never heard of anyone using them together, and how they might interact under high load could be quite unpredictable.
6) Interface eth4-04 appears to be overloaded and is losing around 1% of its incoming frames; interface eth2-01 is also losing 0.21% which are both above the 0.1% rule of thumb. This usually means there are not enough SND cores, but you seem to have plenty (20) which may indicate Multi-Queue issues or bumping up against queue limitations for certain interfaces/drivers. But not all RX-DRPs are legitimate ring buffer misses resulting in frame loss, so please post the output of the following expert mode commands for further analysis:
ethtool -i eth4-04
ethtool -S eth4-04
ethtool -i eth2-01
ethtool -S eth2-01
mq_mng -o -v
7) Would be nice to fix your templating issues shown by fwaccel stat (you're still getting a 4% template cache hit rate even with these issues), as excessive rule base lookups due to accept template cache misses can drive up the CPU on the workers and keep them from emptying their incoming queues fast enough. Your level of F2F/slowpath traffic is fine.
I'll need to see some more data, but at first glance my gut feel is that disabling either QoS or Hyperflow may help. One of these is probably trying to slow things down (shape) and the other is trying to make things go much faster. I'm very unsure about how these two are going to affect each other.
Thank you both for your answers.
This morning, we disabled QoS, and the CPU load stabilized after this modification.
However, based on all our research and the additional information you provided, we need to take our optimization efforts to the next level.
To answer your question properly:
We have two 26000 appliances in a ClusterXL configuration running R81.20 Take 120.
Honestly, I had never heard of KPPAK and UPPAK before, so I will read up on them.
We initially deployed our environment on R81.10 three years ago, so based on what you said, we should be using KPPAK.
The QoS blade was activated last summer to guarantee bandwidth for our telephony environment, which recently transitioned to a full SIP trunk, ensuring a minimum level of service.
I have attached the output of the five commands you suggested running.
Finally, I will work on our policy to improve the conditions for the acceleration template to function more effectively.
Thank you again for your help.
P.S. We bought your book last week.
A 26000 model is not a Quantum Force, so it can't use UPPAK until you upgrade to R82.10 where it is mandatory for all models. Ironically, by using poll mode instead of interrupt mode on the SNDs to empty ring buffers, UPPAK will solve the exact problem you are having.
Multi-Queue appears to be doing a good job of balancing traffic across the SNDs for eth4-04 and eth2-01. However because these interfaces are using the i40e driver (which supports up to 64 total queues), and you have a total of 5 active interfaces using the i40e driver (eth2-01, eth3-01, eth3-04, eth4-01, eth4-04) each interface can only have 12 total queues, which means only 12 SNDs can empty each interface's ring buffer, no matter how many SNDs there are. This is probably why eth4-04 is falling behind with the 1% loss via RX-DRP. I would definitely recommend an 802.3ad LACP Active/Active bond here if possible, with the transmit hash policy set to L3+4 on both sides. This bond will create a grand total of 20 queues, allowing up to 20 SNDs to help with this heavy load and keep it from falling behind, as well as increased bandwidth for heavy bursts of traffic.
I'm working on a new book right now that will be fully updated and cover situations like this. For the templating issues, here is the content from the new book documenting the current conditions that cause accept templating to stop:
SecureXL Session Rate Acceleration (Accept Templates)
A rulebase lookup is one of the most computationally expensive operations that a firewall has to perform, although the advent of Column-based Matching described in the last section has reduced the overhead substantially. However, the ability of SecureXL to form Accept Templates (essentially a "cached" rulebase lookup) for repeated, substantially similar connections predates Column-based Matching by many years, but it no longer provides the same level of performance boost it did then.
To check the state of SecureXL Accept Templating, run fwaccel stat. Once an Accept template is created, substantially similar connections for the next 3 minutes can "hit" on the cached entry. Each hit resets the three-minute timer. As long as you see "Accept Templates: enabled" you are good to go, and the entire Firewall/Network policy layer is eligible to have Accept templates created for it (usually...unless the indicated templating match rate always seems to be zero...more on that bit later!):
But what if you see something like this:
Thank you for your suggestions.
We would like to provide a quick update on our side. Like I already said, disabling QoS immediately brought the CPU back to a normal level, which confirms that QoS was contributing to the high CPU usage. However, the fwFullyUtilizedDrops counter only decreased by about half, and we still consider the remaining value abnormally high. This point remains the main concern of our original question.
In addition, we are still seeing an unusually high number of “first packet isn’t SYN” events (about 80 drops/sec). This is a long‑standing issue in our environment, and it could to be related to the behavior we are observing with the drops.
At the moment, fwFullyUtilizedDrops are around 2000 drops/sec, while the drops reported by fw ctl zdebug are approximately 500 drops/sec. Regardless of whether we run
we do not see the fwFullyUtilizedDrops reflected in the zdebug output.
We will proceed with implementing all the recommendations you suggested (or as much as possible) and will report back in this forum once everything is in place and we have results to share.
Thank you again for your help and time.
Best regards.
Hello,
We have corrected the accept templates configuration. It is now set to Enabled:
Result: the number of fwFullyUtilizedDrops did not decrease. We are still observing approximately 2000–2500 drops/sec during peak production hours.
Before moving forward with interface bonding (802.3ad LACP Active/Active), we wanted to ensure that there was a real correlation between the number of NIC queues and the fwFullyUtilizedDrops counter.
To validate this, we deliberately reduced the number of queues by half on our two highest packet-rate interfaces.
Before the configuration change, we had the following setup:
After the change, the configuration was:
Result: the number of fwFullyUtilizedDrops did not increase, contrary to what we would have expected if the number of queues were directly correlated with this counter.
Based on this outcome, we currently conclude that implementing bonding (802.3ad LACP Active/Active) to increase the number of queues should not reduce the observed drops — unless we are missing an important aspect of how this counter works internally.
At this time, the only action that has resulted in a noticeable reduction of fwFullyUtilizedDrops was disabling the QoS blade.
For additional context:
cpview (Overview section) reports approximately 350,000 packets/secfwFullyUtilizedDrops is around 2500/sec, which represents close to 1% packet lossThis leads us to question our understanding of the fwFullyUtilizedDrops counter itself.
Are these actual packet drops, even though there is no clear way to identify where in the datapath the drops are occurring?
We are planning to upgrade to R82 in the coming weeks, and we will observe whether this improves the situation.
Thank you for your time and for any additional insight you may be able to provide.
Best regards,
After extensive troubleshooting, we finally identified the root cause — and it turned out to be a monitoring configuration error on our end.
We had assigned OID iso.3.6.1.4.1.2620.1.1.25.13.0 (fwLoggedTotal) to our Zabbix item instead of iso.3.6.1.4.1.2620.1.1.25.26.0 (fwFullyUtilizedDrops). We were never actually observing fwFullyUtilizedDrops at all — we were monitoring the total number of logged connections.
In hindsight, the most obvious clue was right there from the beginning: the drops were completely invisible in zdebug drop and in SmartConsole logs. That inconsistency should have immediately pointed to a measurement problem rather than an actual performance issue.
We want to thank Timothy Hall and Bob Zimmerman for their thorough and technically sound analysis of our CoreXL/SMT configuration. Even though it was based on a false premise, the findings regarding SND/Worker core overlap on our dual-socket setup are real and worth addressing independently.
Lesson learned: always validate your data source before analyzing the data. And also, don't trust blindly mister AI ...
Closing this thread as resolved.
Leaderboard
Epsum factorial non deposit quid pro quo hic escorol.
| User | Count |
|---|---|
| 37 | |
| 14 | |
| 11 | |
| 10 | |
| 10 | |
| 10 | |
| 7 | |
| 7 | |
| 7 | |
| 6 |
Tue 28 Apr 2026 @ 06:00 PM (IDT)
Under the Hood: Securing your GenAI-enabled Web Applications with Check Point WAFThu 30 Apr 2026 @ 03:00 PM (PDT)
Hillsboro, OR: Securing The AI Transformation and Exposure ManagementTue 28 Apr 2026 @ 06:00 PM (IDT)
Under the Hood: Securing your GenAI-enabled Web Applications with Check Point WAFTue 12 May 2026 @ 10:00 AM (CEST)
The Cloud Architects Series: Check Point Cloud Firewall delivered as a serviceThu 30 Apr 2026 @ 03:00 PM (PDT)
Hillsboro, OR: Securing The AI Transformation and Exposure ManagementAbout CheckMates
Learn Check Point
Advanced Learning
YOU DESERVE THE BEST SECURITY