Hello to all,
This is my first post to this community, it's a privilege though and I'm confident that the high level of technical skillset will be of great assistance.
I've recently updated our R80.30 cluster to Take 219 JHF, because we had a lot of private fixes that got included to this and to gain overall performance. The cluster consists of 2x 23900 in Active-Standby with multiple 10G interfaces and multiqueue enabled (plus some static affinity configured) and hyperthreading disabled.
So the above basically translated to 6 CPUs for SND and 30 for CoreXL. From 6 SND, 4 are multiqueue allocated and 2 are responsible for physical remaining interfaces:
[Expert@WALL1.2:0]# cpmq get
Active ixgbe interfaces:
eth1-01 [On]
eth1-02 [Off]
eth4-01 [On]
eth4-02 [Off]
Active igb interfaces:
Mgmt [Off]
Sync [Off]
eth2-01 [Off]
eth2-02 [Off]
eth2-03 [On]
eth2-04 [On]
[Expert@WALL1.2:0]# fw ctl affinity -l -r | grep -e CPU\ 4 -e CPU\ 5
CPU 4: eth4-02 Sync eth2-01
CPU 5: Mgmt eth1-02 eth2-02
We had also configured 4 RX queues for multi queue. Since yesterday morning though, we experienced heavy utilization on the CPU cores 4 & 5, which at the end impacted all traffic traversing these interfaces. After a lot of troubleshooting, including CP support, what we ended configuring was changing the interfaces participating in multi queue (since there is a limitation of only 5 that this can be enabled), reconfigured CoreXL from 30 to 28 and statically configuring affinity for eth1-01 and eth4-01 to the newly introduced SND cores:
[Expert@WALL1.1:0]# sim affinity -l
eth4-01 : 6
eth2-01 : 4
eth1-01 : 7
eth2-02 : 5
Mgmt : 5
Sync : 4
Multi queue interfaces: eth1-02 eth4-02 eth2-03 eth2-04
[Expert@WALL1.1:0]# fw ctl affinity -l -r
CPU 0:
CPU 1:
CPU 2:
CPU 3:
CPU 4: Sync eth2-01
CPU 5: Mgmt eth2-02
CPU 6: eth4-01
CPU 7: eth1-01
[Expert@WALL1.1:0]# cpmq get
Active ixgbe interfaces:
eth1-01 [Off]
eth1-02 [On]
eth4-01 [Off]
eth4-02 [On]
Active igb interfaces:
Mgmt [Off]
Sync [Off]
eth2-01 [Off]
eth2-02 [Off]
eth2-03 [On]
eth2-04 [On]
Now, even though performance is way way better, the SND cores are still used quite enough and we have not reached yet the day's peak, e.g.:
[Expert@WALL1.1:0]# cpstat -f multi_cpu os
Processors load
---------------------------------------------------------------------------------
|CPU#|User Time(%)|System Time(%)|Idle Time(%)|Usage(%)|Run queue|Interrupts/sec|
---------------------------------------------------------------------------------
| 1| 0| 38| 62| 38| ?| 87408|
| 2| 0| 52| 48| 52| ?| 87409|
| 3| 0| 35| 65| 35| ?| 87409|
| 4| 0| 31| 69| 31| ?| 87410|
| 5| 0| 26| 74| 26| ?| 87411|
| 6| 0| 42| 58| 42| ?| 87411|
| 7| 0| 17| 83| 17| ?| 87412|
| 8| 0| 47| 53| 47| ?| 87413|
Also, SecureXL seems to do a decent job:
[Expert@WALL1.1:0]# fwaccel stats -s
Accelerated conns/Total conns : 5212/40836 (12%)
Accelerated pkts/Total pkts : 466598460/846769916 (55%)
F2Fed pkts/Total pkts : 87904870/846769916 (10%)
F2V pkts/Total pkts : 4047817/846769916 (0%)
CPASXL pkts/Total pkts : 0/846769916 (0%)
PSLXL pkts/Total pkts : 292266586/846769916 (34%)
QOS inbound pkts/Total pkts : 0/846769916 (0%)
QOS outbound pkts/Total pkts : 0/846769916 (0%)
Corrected pkts/Total pkts : 0/846769916 (0%)
[Expert@WALL1.1:0]# fwaccel stat
+-----------------------------------------------------------------------------+
|Id|Name |Status |Interfaces |Features |
+-----------------------------------------------------------------------------+
|0 |SND |enabled |Mgmt,Sync,eth2-01, |
| | | |eth2-02,eth2-03,eth2-04, |
| | | |eth4-01,eth4-02,eth1-01, |
| | | |eth1-02,pimreg0 |Acceleration,Cryptography |
| | | | |Crypto: Tunnel,UDPEncap,MD5, |
| | | | |SHA1,NULL,3DES,DES,CAST, |
| | | | |CAST-40,AES-128,AES-256,ESP, |
| | | | |LinkSelection,DynamicVPN, |
| | | | |NatTraversal,AES-XCBC,SHA256 |
+-----------------------------------------------------------------------------+
Accept Templates : enabled
Drop Templates : enabled
NAT Templates : enabled
So, after all the above, my question is: since this seems to be a connection handling case, how do I get what is causing all this CPU utilization? How can I understand what traffic could be causing this? During the high load, I've done a tcpdump & cpmonitor debug to try and identify traffic that is new (?) or could break acceleration, but to no avail. I've also added some fast_accel rules for backup traffic and CIFS (CIFS was already in place), but was not able to identify any performance gains. And yes, this gateway cluster is indeed responsible for VPN and i've already configured cphwd_medium_path_qid_by_mspi=0, which really skyrocketed VPN performance.
Thank anybody for just reading the above.