Re: R80.10 VSX performance issue - high CPU utiliz...

David_Guan · ‎2019-02-25

Hi all,

I have a few questions regarding the R80.10 VSX performance issue. Actually recently we have been having a series of different kind of performance issues.

In our environment we have multiple virtual systems created. VS2 and VS1 are connected through Virtual Switch (Perimeter Firewall - CKP VS2 - CKP VS1 - Internal network), while VS4 is acting as a site2site VPN gateway.

The most recent issue was the high CPU on only one CPU core reaching 99%.

From the zdebug output we can see quite a few packets dropped with two typical messages as below:

"...dropped by fwkdrv_enqueue_packet_user_ex Reason: Instance is currently fully utilized;"

"...dropped by cphwd_pslglue_handle_packet_cb Reason: F2P: Instance is currently fully utilized;"

And during the incident period we had ICMP traffic passing through the firewall and were getting high latency as around 1000ms however we were not getting any ICMP timeout.

We failed to capture more evidence except for the high CPU utilization on one specific Kernel instance fwk2_2.

After the issue automatically went away, we had discussion with TAC engineer and he advised us to conduct one change on one of the specific CoreXL kernel parameter "fwmultik_input_queue_len" (from VSX default 2048 to 8196).

After that we have not had further issue however the root cause analysis is still ongoing.

Questions as below:

1) Why were the packets got dropped and why for ICMP we only had high latency issue not ICMP timeout due to the packet drop?

2) What is "fwmultik_input_queue_len"? What is the input queue used for and where exactly it resides from the packet flow point of view? How can we validate the current input queue length config and its utilization and based on that set up active alert?

Regards,

David

Danny · ‎2019-02-25

Thanks to Kaspars Zibarts‌ great presentation at CPX360 Vienna you could start to verify if your VS systems are all running at 64bit (in admin mode: show version os edition, in expert mode: vs_bits -stat ). Next check your connections table capacity, max. concurrent connections setting (15000 default?), fw ctl pstat, aggressive aging logs, multiqueue and hyperthreading demands, cpmq get -vv, CoreXL sharing settings and core mapping.

David_Guan · ‎2019-02-25

Hi Danny,

Thanks for sharing ur insight. We did already have the max. Concurrent connections limit issue with default 15,000 value and recently increased to 500,000 for each VS.

We are still going through the current CoreXL and SecureXL affinity setup to find out what need to be optimised.

Actually for root cause analysis, I read through a few articles about the packet flow within Checkpoint however would like to understand in more depth how the input queue is working. Is it the driver queue? Buffer at IP stack? Or something else? How many queues are there? How these queues are mapping with the CPU core? Could increasing the queue length could cause more latency?

TAC did confirm with us that ICMP traffic is always handled by FW kernel instance. Then why we did not have any ICMP packets dropped? Are they handled differently from the TCP/UDP traffic?

Sorry I bring up so many questions because I could not find a KB article to explain in depth above.

Anyone could please explain this to me?

Thanks!

David

Kaspars_Zibarts · ‎2019-02-26

Hi David

here's link to my presentation that might help but it might hard to understand without audio help

VSX performance optimisation.pdf

You would have to share a bit more details about your VSX config in order to help you more.

In particular great start would be

fw ctl affinity -l

and some statistics from both VS1 and VS2 - max throughput, max concurrent connections and traffic profile - how much is accelerated and how much takes medium or slow path

David_Guan · ‎2019-03-01

Hi Kaspars,

Thanks for the presentation pdf.

Very disappointing that the TAC could not explain further what the fwmultik_input_queue_len is related and why we were getting both packet drops on TCP/UDP and high latency on ICMP traffic. We will ask Checkpoint to further escalate and try to talk to the right person who knows in-depth about this kernel parameter. Anywhere I can find the documentation on the kernel global parameters especially for CoreXL?

Here comes the output of a few commands to validate the core affinity and interface affinity setting, as well as the SecureXL stats.

[Expert@CKP001:1]# fwaccel stats -s
Accelerated conns/Total conns : 384/13651 (2%)
Accelerated pkts/Total pkts : 46129348/650845659 (7%)
F2Fed pkts/Total pkts : 205977595/650845659 (31%)
PXL pkts/Total pkts : 398738716/650845659 (61%)
QXL pkts/Total pkts : 0/650845659 (0%)

[Expert@CKP001:2]# fwaccel stats -s
Accelerated conns/Total conns : 160/15762 (1%)
Accelerated pkts/Total pkts : 28265482/638543714 (4%)
F2Fed pkts/Total pkts : 203529348/638543714 (31%)
PXL pkts/Total pkts : 406748884/638543714 (63%)
QXL pkts/Total pkts : 0/638543714 (0%)

[Expert@CKP001:4]# fwaccel stats -s
Accelerated conns/Total conns : 2910/3159 (92%)
Accelerated pkts/Total pkts : 61579081/216347018 (28%)
F2Fed pkts/Total pkts : 7730087/216347018 (3%)
PXL pkts/Total pkts : 147037850/216347018 (67%)
QXL pkts/Total pkts : 0/216347018 (0%)

[Expert@CKP001:0]# fw ctl affinity -l -a -v
Interface eth2-03 (irq 92): CPU 0
Interface Mgmt (irq 124): CPU 14
Interface Sync (irq 140): CPU 11
VS_0: CPU 2 3 4 5 6 7
VS_0 fwk: CPU 2 3 4 5 6 7
VS_1: CPU 2 3 4 5 6 7
VS_1 fwk: CPU 2 3 4 5 6 7
VS_2: CPU 2 3 4 5 6 7
VS_2 fwk: CPU 2 3 4 5 6 7
VS_3: CPU 2 3 4 5 6 7
VS_3 fwk: CPU 2 3 4 5 6 7
VS_4: CPU 2 3 4 5 6 7
VS_4 fwk: CPU 2 3 4 5 6 7

[Expert@CKP002:0]# sim affinity -l -r -v
Mgmt : 14
Sync : 11
eth1-01 : 1
eth1-03 : 9
eth1-04 : 12
eth2-01 : 8
eth2-03 : 0
eth2-04 : 15
eth3-04 : 0

[Expert@CKP002:0]# fw ctl multik stat
ID | Active | CPU | Connections | Peak
----------------------------------------------
0 | Yes | 2-7 | 19 | 1018
1 | Yes | 2-7 | 116 | 508

[Expert@CKP002:1]# fw ctl multik stat
ID | Active | CPU | Connections | Peak
----------------------------------------------
0 | Yes | 2-7 | 6021 | 25584
1 | Yes | 2-7 | 4330 | 23168
2 | Yes | 2-7 | 3766 | 12999

[Expert@CKP002:2]# fw ctl multik stat
ID | Active | CPU | Connections | Peak
----------------------------------------------
0 | Yes | 2-7 | 6733 | 15176
1 | Yes | 2-7 | 4697 | 16475
2 | Yes | 2-7 | 3990 | 12751

[Expert@CKP002:4]# fw ctl multik stat
ID | Active | CPU | Connections | Peak
----------------------------------------------
0 | Yes | 2-7 | 28792 | 43845
1 | Yes | 2-7 | 29796 | 46870

ddxrrx · ‎2019-03-26

Hi Kaspars,

Is there anywhere that I could view your presentation that goes along with the attached pdf?

PhoneBoy · ‎2019-03-26

As this was a CPX360 presentation, we only uploaded the PDF.

@Kaspars_Zibarts if you're ok with it, I can replace the link with the PPT version.

Kaspars_Zibarts · ‎2019-03-28

Sorry guys, been snowed under with work so have ignored checkmates for last couple of weeks 😞 yes absolutely, share PPT - I checked comments, they are not perfect but should help

David_Guan · ‎2019-03-13

An update on this:

We are still waiting for Checkpoint to explain in depth and detail what fwmultik_input_queue_len is and why ICMP traffic was only getting high latency not packet drops at all like TCP/UDP traffic during the incident period.

Reading through below two KB articles one on priority queue and the other on Dynamic Dispatcher, I believe they are correlated to the performance issue we had. The queue length was also mentioned in the article on Priority Queue.

https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsolutiondetails=&solut...

Unfortunately in our VSX environment with R80.10, both features cannot be supported. We have to wait for R80.30 to be stable and then plan for the N-1 upgrade to R80.20 so that we can deploy Dynamic Dispatcher and limited Priority Queue (static priority mode only).

Wolfgang · ‎2019-03-13

David,

end of last year we had a similar issue "dropped by fwkdrv_enqueue_packet_user_ex Reason: Instance is currently fully utilized" resulting in a complete outage of one of VSX clusternodes. Moving the VS to another node solve the problem for some hours but did occur ed again.

After some very long debug sessions with TAC we found the root cause and are able to solve the problem. We changed the value fwmultik_input_queue_len to 8196. This scenario is described in sk61143 and there are some explanations about the value:

Queue utilization depends on the packet rate rather than on the amount of connections. The loaded instance's CPU, from time to time reaches 100%, so it handles packets slower than the packets are received.
If the load is not persistent (comes and goes), enlarging the queue should resolve the issue, as it allows more packets to "wait" in the queue, for processing of the CoreXL FW instance. If the high load is persistent, expect to see the same issue again, after a while. This needs to be handled as part of performance tuning

Our main problem was a lot of more Microsoft traffic (like CIFS, ALL_DCE_RPC etc.). Sometimes we got 1-1,5Gb/s CIFS traffic over a VPN-tunnel which causes some CPUs going to 100%.

Changing the mentioned value and adding some more CPUs to the VS solves all problems.

Wolfgang

Kaspars_Zibarts · ‎2019-03-13

Great info! I guess reading between lines the problem is a single elephant flow that causes a specific single core to overload. And since dynamic dispatcher is not supported in VSX R80.10 adding new cores won't help 😞

Jerry · ‎2019-03-14

sorry to jump in but did not notice anything like that within the thread, did you already try proper SIM affinity settings?

Jerry

Are you a member of CheckMates?

R80.10 VSX performance issue - high CPU utilization on specific core