Hello,
After a recent upgrade to R81.20 we are seeing recurring issues with some IPSEC site-to-site VPNs. There are roughly 30 tunnels hosted on a 6200 box with not so much traffic but critical importance as it is connecting AWS, Azure and on-prem for different web applications.
The facts I have up to know are that
[1] The issue never manifested on R80.40
[2] Happened 5 times in the last 2 weeks since we went to R81.20
[3] The CPU slowly goes into 100% and then stays there and this probably impacts IKE negotiation.
[4] Graphs show steady increase in interface errors before the issue happens
[5] Graphs show constant increase in F2F packets and at some time the issue happens
[6] The only valid solution is reboot and switchover to the other member and wait for the ramp-up to bring it to breaking point
Opened a TAC ticket and there is no meaningful response other than, it may have happened to other customers and we are working on it but usually on SND cores and not FW Workers (this particular case)
Questions:
[A] Do you know of any issues on R81.20 JHF53 where traffic is not accelerated properly?
[B] Should IPSEC traffic terminating on the device be accelerated?
[C] What can be the source of "Failed to get native SXL device" issue seen in CPVIEW > Advanced tab?
Long Details:
Upon investigation we are seeing that in spike detective the top usage comes from two process (spike_detective):
##head -30 perf_thread_452.lo
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 3K of event 'cycles'
# Event count (approx.): 3651993252
#
# Overhead Command Shared Object Symbol
# ........ ....... ..................... .....................................................
#
73.07% fwk0_2 [kernel.kallsyms] [k] fwmultik_do_seq_on_packet
17.79% fwk0_2 [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
##head -30 perf_cpu_3.lo
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 3K of event 'cycles'
# Event count (approx.): 3307288381
#
# Overhead Command Shared Object Symbol
# ........ ....... ..................... ................................................................................
#
63.14% fwk0_0 [kernel.kallsyms] [k] fwmultik_do_seq_on_packet
18.13% fwk0_0 [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
Looking at the stats on the firewall, there is almost no acceleration (IPS was enabled a day ago as a mitigation for SYN attacks based on other posts by Tim Hall and Heiko):
##enabled_blade
fw vpn ips identityServer mon
##fwaccel stats -
Accelerated conns/Total conns : 634/9357 (6%)
LightSpeed conns/Total conns : 0/9357 (0%)
Accelerated pkts/Total pkts : 500464710/5892962467 (8%)
LightSpeed pkts/Total pkts : 0/5892962467 (0%)
F2Fed pkts/Total pkts : 5392497757/5892962467 (91%)
F2V pkts/Total pkts : 4696613/5892962467 (0%)
CPASXL pkts/Total pkts : 0/5892962467 (0%)
PSLXL pkts/Total pkts : 274040788/5892962467 (4%)
CPAS pipeline pkts/Total pkts : 0/5892962467 (0%)
PSL pipeline pkts/Total pkts : 0/5892962467 (0%)
QOS inbound pkts/Total pkts : 0/5892962467 (0%)
QOS outbound pkts/Total pkts : 0/5892962467 (0%)
Corrected pkts/Total pkts : 0/5892962467 (0%)
Reason for no Acceleration: Local traffic (IPSEC ends on the gateway) and Native SXL device cannot be found.
Steady interface error increase before the issue (maybe because the CPU is handling packets and it is at 100% so it discards erratically)