Hi all,
I have a few questions regarding the R80.10 VSX performance issue. Actually recently we have been having a series of different kind of performance issues.
In our environment we have multiple virtual systems created. VS2 and VS1 are connected through Virtual Switch (Perimeter Firewall - CKP VS2 - CKP VS1 - Internal network), while VS4 is acting as a site2site VPN gateway.
The most recent issue was the high CPU on only one CPU core reaching 99%.
From the zdebug output we can see quite a few packets dropped with two typical messages as below:
"...dropped by fwkdrv_enqueue_packet_user_ex Reason: Instance is currently fully utilized;"
"...dropped by cphwd_pslglue_handle_packet_cb Reason: F2P: Instance is currently fully utilized;"
And during the incident period we had ICMP traffic passing through the firewall and were getting high latency as around 1000ms however we were not getting any ICMP timeout.
We failed to capture more evidence except for the high CPU utilization on one specific Kernel instance fwk2_2.
After the issue automatically went away, we had discussion with TAC engineer and he advised us to conduct one change on one of the specific CoreXL kernel parameter "fwmultik_input_queue_len" (from VSX default 2048 to 8196).
After that we have not had further issue however the root cause analysis is still ongoing.
Questions as below:
1) Why were the packets got dropped and why for ICMP we only had high latency issue not ICMP timeout due to the packet drop?
2) What is "fwmultik_input_queue_len"? What is the input queue used for and where exactly it resides from the packet flow point of view? How can we validate the current input queue length config and its utilization and based on that set up active alert?
Regards,
David