Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Netadmin2020
Collaborator
Jump to solution

High Cpu utilization - Internet Speed drops to very low level

Dear Colleagues ,

 

I have a problem with high CPU utilization and by the time of this issue the internet speed to go down dramatically.

Appliance Check Point 15600 - Hot fix: Check_Point_R80_30_JUMBO_HF_Bundle_T219_sk153152
 
I am attaching some photos from the commands executed by the time of issue.
advanced.PNGcpu.PNGcpu2.PNGcpu3.PNGextra.PNGfw mul util.PNGhealth.PNGio.PNGmemory.PNGnetwork1.PNGstats.PNGsysinfo.PNGtop protocols.PNG
 
FYI we are using Policy Based Routing & Remote Access VPN.
 

    

1 Solution

Accepted Solutions
Timothy_Hall
Legend Legend
Legend

That can't be a 15600, a 15600 has 8 physical cores and 16 present with SMT enabled.  Your screenshots are showing 32 cores.  

It appears you have the default 4/28 split and CPUs 0/16 and 1/17 are getting killed with 46% of traffic fully accelerated, while Firewall Worker cores sit relatively idle.  Likely that your interfaces are experiencing high frame loss via RX-DRP during the slow periods, which is visible with netstat -ni.

Would recommend decreasing number of firewall instances from 28 to 24 via cpconfig for a target 8/24 split.  Once that is done and system rebooted, enable Multi-Queue on your 10Gbps interfaces which will require another separate reboot under Gaia kernel 2.6.18 which appears to be what you are using.  Memory and disk look fine.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

View solution in original post

25 Replies
Magnus-Holmberg
Advisor
Advisor

Are you using multi queue on the interface?   fw ctl affinity –l 
could you show enabled_blades as well?

any messages in var/log/messages?

https://www.youtube.com/c/MagnusHolmberg-NetSec
Netadmin2020
Collaborator
 
Magnus-Holmberg
Advisor
Advisor

Where is your traffic going in and out?

CPU0 and that is allocated to Eth3-02, Eth2-06
Based on your picture above, this CPU core goes full.
What are those interfaces used for, if its used for traffic i would activate multi Q so that you can dist the load over multiple cores.
https://sc1.checkpoint.com/documents/R80.30/WebAdminGuides/EN/CP_R80.30_PerformanceTuning_AdminGuide...

Regards,
Magnus

https://www.youtube.com/c/MagnusHolmberg-NetSec
Timothy_Hall
Legend Legend
Legend

That can't be a 15600, a 15600 has 8 physical cores and 16 present with SMT enabled.  Your screenshots are showing 32 cores.  

It appears you have the default 4/28 split and CPUs 0/16 and 1/17 are getting killed with 46% of traffic fully accelerated, while Firewall Worker cores sit relatively idle.  Likely that your interfaces are experiencing high frame loss via RX-DRP during the slow periods, which is visible with netstat -ni.

Would recommend decreasing number of firewall instances from 28 to 24 via cpconfig for a target 8/24 split.  Once that is done and system rebooted, enable Multi-Queue on your 10Gbps interfaces which will require another separate reboot under Gaia kernel 2.6.18 which appears to be what you are using.  Memory and disk look fine.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
Netadmin2020
Collaborator

appliance.JPG

You refer that my interfaces are are losing packets during slow periods. What is exactly the reason for this?

These are my setting on perfomance section:

no opt.JPG

 

if I choose optimize for packet rate and throughput ? it recommends 6 for performance and 26 for instances.

perfomance 1.JPG

 

 

 

PhoneBoy
Admin
Admin

Personally, I’d upgrade to R80.40 or R81 and enable Dynamic Workloads.
All this stuff will then be tuned automatically on the fly.

Magnus-Holmberg
Advisor
Advisor

in this case reinstalling it to R80.30,  3.10 would have fixed it aswell, but well an upgrade to R80.40 may be easier 🙂

https://www.youtube.com/c/MagnusHolmberg-NetSec
Timothy_Hall
Legend Legend
Legend

The reason for the packet loss is insufficient CPU resources allocated to empty the 10Gbps interface ring buffers fast enough via SoftIRQ (RX-DRP), which is a responsibility of an SND/IRQ core.  Based on your traffic mix and and the other information you provided, I'd recommend an 8/24 split with Multi-Queue enabled on your busy 10Gbps interfaces. 

6/26 as the performance slider is showing is a general recommendation, my 8/24 recommendation is specific based on the info you provided.  6/26 might be OK too.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
Chris_Atkinson
Employee Employee
Employee

+1 for Timothy Hall's recommendation.

Note also the following regarding Remote Access VPN in case the volumes are significant.
sk165853: High CPU usage on one CPU core when the number of Remote Access users is high

Indeed the number of cores consistent with that of a 15600 is shown here per:
https://www.checkpoint.com/downloads/products/15600-security-gateway-datasheet.pdf 

Additionally, note only the 2.6 kernel is supported on the 15600 until R80.40.

CCSM R77/R80/ELITE
Timothy_Hall
Legend Legend
Legend

Ah, I wasn't accounting for the "2X" Intel Xeon E5-2630v3 (8/16 core) thought it was just 1X.  Thanks for the clarification Chris.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
Netadmin2020
Collaborator

Ηello again and thank you all for your detailed answers.

I have two nodes, so when I turn the settings at first to the standby node and set the specific interfaces to multi que mode with core split to 6/26 after reboot the cluster seems that is not syncing.

Timothy_Hall
Legend Legend
Legend

Correct it will not sync until both members have an identical CoreXL split.  You can blunt the impact of a non-stateful failover by temporarily unchecking "Drop out of state TCP packets" on the Stateful Inspection screen of Global Properties and reinstalling policy to both members.  Don't forget to recheck it when finished!

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
PhoneBoy
Admin
Admin

Something that won’t happen when you use R80.40+ and dynamic workloads to manage the split, FYI. 

Netadmin2020
Collaborator

Just to be sure:

a) I turn the second node to multi que and change the core split

b) saving and reboot

c) On the first node I should change the “Drop out of state TCP packets”

d) install policy on both nodes

yes?

Timothy_Hall
Legend Legend
Legend

a) uncheck drop out of state packets

b) publish and reinstall to both

c) change core split of second node

d) reboot it (you can't change core split and multi-queue at the same time, you must reboot after doing either one)

e) enable multi queue for relevant interface(s) on second node

f) reboot it

 

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
genisis__
Leader Leader
Leader

Tim - have another post related to this same topic which you have already commented on in regards to my VSX setup (TAC are still dragging there heels on this).

I was wondering if anyone has actually deployed R81 in a VSX setup yet and if this has any  reported issues? 

I'm looking to upgrade my R80.20 VSX setup to R80.40 or R81, would like to move to R81 but I think it may be a little too early for this.

Also I know the recommendation is to rebuild, but in the current climate remote upgrade is preferred method.  So I will likely do an inline upgrade; from what I can tell kernel version would get upgraded, multi-queue turned on and other parameters turned on by default such are CORE load balancing parameter (SK168513).

Clearly new filesystem would not get used, but I don't see this being hugely important at the gateway side (happy to be educated on this if I'm wrong).

 

Timothy_Hall
Legend Legend
Legend

I personally have not worked with VSX under R81 yet, I'd suggest posting a new thread asking for community experiences where more people can see it.  The resident community VSX expert (especially when it comes to performance) is @Kaspars_Zibarts.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
Netadmin2020
Collaborator

Great,

with the above steps I suppose that I won’t have any interruptions ?

Magnus-Holmberg
Advisor
Advisor

As the sync is not working, there will be a small dipp.
The packages will need to be reestablished.

https://www.youtube.com/c/MagnusHolmberg-NetSec
Maarten_Sjouw
Champion
Champion

Just don't forget to  check drop out of state packets at least an hour after you have changed both members and push policy again.

Regards, Maarten
Netadmin2020
Collaborator

Merry Christmas to all! 

I made a guide and I want to ask some final details.

A) Global Properties uncheck the drop out of state TCP Packets for blunt the impact to the failover
B) Install the policy on both cluster members
C) Change the core split of the Second Node
D) Reboot the Second Node
E) Enable multique at relevant interfaces (2nd node)
F) Reboot

After the above steps theoretical there will be no cluster so I should the reboot the active (first) node and the 2nd node with the new settings will take over?

thanx 

 

Maarten_Sjouw
Champion
Champion

Yes that is correct, but you should check the setting of the cluster for the member that will be primary, make sure it is set to keep the active member and not return to the highest priority member.

Before rebooting the original active member first change the setting of the core assignment, then you take care of 2 things in 1 go.

Regards, Maarten
Netadmin2020
Collaborator

Final Guide:

A) Global Properties uncheck the drop out of state TCP Packets for blunt the impact to the failover
B) Install the policy on both cluster members
C) Change the core split of the Second Node
D) Reboot the Second Node
E) Enable multi que at relevant interfaces (2nd node)
F) Reboot

The Second Node is ready with the new settings so we have to restart the active node and the second node
will take over.
After the reboot and when everything works appropriate we must uncheck the drop of state tcp packets and install policy to both members

Netadmin2020
Collaborator

Hello guys,

I finalize two days ago the whole procedure with success. I am monitoring the gateway but I will be sure next week. Until Now everything seems good.

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events