Re: Performance issue generating VPN instability

Cezar_Octavian_ · ‎2024-06-13

Hello,

After a recent upgrade to R81.20 we are seeing recurring issues with some IPSEC site-to-site VPNs. There are roughly 30 tunnels hosted on a 6200 box with not so much traffic but critical importance as it is connecting AWS, Azure and on-prem for different web applications.

The facts I have up to know are that

[1] The issue never manifested on R80.40

[2] Happened 5 times in the last 2 weeks since we went to R81.20

[3] The CPU slowly goes into 100% and then stays there and this probably impacts IKE negotiation.

[4] Graphs show steady increase in interface errors before the issue happens

[5] Graphs show constant increase in F2F packets and at some time the issue happens

[6] The only valid solution is reboot and switchover to the other member and wait for the ramp-up to bring it to breaking point

Opened a TAC ticket and there is no meaningful response other than, it may have happened to other customers and we are working on it but usually on SND cores and not FW Workers (this particular case)

Questions:

[A] Do you know of any issues on R81.20 JHF53 where traffic is not accelerated properly?

[B] Should IPSEC traffic terminating on the device be accelerated?

[C] What can be the source of "Failed to get native SXL device" issue seen in CPVIEW > Advanced tab?

Long Details:

Upon investigation we are seeing that in spike detective the top usage comes from two process (spike_detective):

##head -30 perf_thread_452.lo
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 3K of event 'cycles'
# Event count (approx.): 3651993252
#
# Overhead Command Shared Object Symbol
# ........ ....... ..................... .....................................................
#
73.07% fwk0_2 [kernel.kallsyms] [k] fwmultik_do_seq_on_packet
17.79% fwk0_2 [kernel.kallsyms] [k] native_queued_spin_lock_slowpath

##head -30 perf_cpu_3.lo
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 3K of event 'cycles'
# Event count (approx.): 3307288381
#
# Overhead Command Shared Object Symbol
# ........ ....... ..................... ................................................................................
#
63.14% fwk0_0 [kernel.kallsyms] [k] fwmultik_do_seq_on_packet
18.13% fwk0_0 [kernel.kallsyms] [k] native_queued_spin_lock_slowpath

Looking at the stats on the firewall, there is almost no acceleration (IPS was enabled a day ago as a mitigation for SYN attacks based on other posts by Tim Hall and Heiko):

##enabled_blade
fw vpn ips identityServer mon

##fwaccel stats -
Accelerated conns/Total conns : 634/9357 (6%)
LightSpeed conns/Total conns : 0/9357 (0%)
Accelerated pkts/Total pkts : 500464710/5892962467 (8%)
LightSpeed pkts/Total pkts : 0/5892962467 (0%)
F2Fed pkts/Total pkts : 5392497757/5892962467 (91%)
F2V pkts/Total pkts : 4696613/5892962467 (0%)
CPASXL pkts/Total pkts : 0/5892962467 (0%)
PSLXL pkts/Total pkts : 274040788/5892962467 (4%)
CPAS pipeline pkts/Total pkts : 0/5892962467 (0%)
PSL pipeline pkts/Total pkts : 0/5892962467 (0%)
QOS inbound pkts/Total pkts : 0/5892962467 (0%)
QOS outbound pkts/Total pkts : 0/5892962467 (0%)
Corrected pkts/Total pkts : 0/5892962467 (0%)

Reason for no Acceleration: Local traffic (IPSEC ends on the gateway) and Native SXL device cannot be found.

Steady interface error increase before the issue (maybe because the CPU is handling packets and it is at 100% so it discards erratically)

Timothy_Hall · ‎2024-06-13

A few points:

1) The high SND load would appear to be caused by this known issue involving TCP Sequence Number validation:

sk181996: Traffic outages may occur because of high utilization of CPU cores that run CoreXL SND ins...

2) When you say interface errors I assume you mean RX-DRP events? Technically that is not an interface error but is usually just a side effect of overloaded SNDs. If it is indeed RX-OVR or RX-ERR you need to get that fixed immediately.

In response to your questions:

A) Almost certainly due to the configuration of the IPS blade. Please run and post the output of hcp -r "Threat Prevention" and hcp -r "Performance"; you may need to first run hcp --enable-product "Performance" and/or hcp --enable-product "Threat Prevention" to unlock these hidden reports depending on your version of hcp. In the meantime you can try running ips off on the fly along with fwaccel stats -r and fwaccel stats -s and see if it gets most traffic out of the F2f/slowpath. ips on is used to resume IPS enforcement on the fly.

B) Normally IPSec traffic terminating on the gateway can remain fully accelerated on the SND core in the fastpath, unless some deeper inspection is being called for by IPS which will require the traffic going to at least the Medium Path on a worker core, but in your case looks like it is going into F2F/slowpath which is not good. See above, almost certainly caused by an IPS protection with a Critical performance impact rating being manually enabled, the hcp reports will show that.

C) No idea, never seen it. Probably just cosmetic.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

the_rock · ‎2024-06-13

Have you tried vpn accel off? Also, any drop log?

Andy

Best,
Andy

Are you a member of CheckMates?

Performance issue generating VPN instability