Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Richard_Wieser
Contributor

Remote Access VPN Packet loss

We have a cluster of 2 19000 appliances running R81.20 JHF 118. We have 10 Gbps connection to the ISP. After upgrading the switch that the cluster is connected to, we are seeing packet loss on Remote Access VPNs. S2S VPNs seem unaffected. Depending on the time of the day, we up to 50% loss on ping tests. I realize it's probably caused by the equipment that was changed (Cisco C9500) but I want to rule out any thing on the Check Point side as seems to only affect RA VPN traffic.

WAN interface:
Interface eth3-01
state on
mac-addr 00:xx:xx:xx:xx:xx
type ethernet
link-state link up
mtu 1500
auto-negotiation on
speed 25G
ipv6-autoconfig Not configured
monitor-mode off
duplex full
link-speed Not configured
comments WAN
ipv4-address xxx.xxx.xxx.xxx/24
ipv6-address ***************
ipv6-local-link-address ***************

Thanks

0 Kudos
23 Replies
israelfds95
Collaborator
Collaborator

In this case, it’s a good idea to start with some basic troubleshooting, focusing on the ISP WAN link used by the Remote Access VPN.

Next On the Check Point side, verify the following:

  • cpview
     - Check memory status and CPU usage.

From Expert mode:

netstat -ni
  - Check if you see an abnormal number of RX-DROP or TX-DROP counters on the relevant NIC used for the RA VPN.

ethtool -g <isp-wan-ra-vpn-nic>
   - confim that the RX/TX maximum it's the same, on UPPAK can be different and cause RX-DROP, KPPAK can fix. 
ethtool -g <nic-connected-to-switch>

Verify NIC ring buffer settings and compare them if needed.

Additionally:

  • Run ping to the ISP gateway and other relevant gateways involved in this connection to check latency and packet loss.

  • Run traceroute to validate the path and identify possible routing or ISP-related issues.

These checks should give you a solid starting point for the investigation.

0 Kudos
Richard_Wieser
Contributor

CPU Looks fine:
| CPU type CPUs Avg utilization |
| CoreXL_SND 10 15% |
| CoreXL_FW 53 17% |
| FWD 1 29%

There are drops which are climbing.
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
eth3-01 1500 0 184241747707 0 111870671 0 103782379871 0 0 0 ABMRU
eth3-02 1500 0 89235490191 0 21345804 0 148776069625 0 6194 0 ABMRU

# ethtool -g eth3-01
Ring parameters for eth3-01:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096

LAN interface:
# ethtool -g eth3-02
Ring parameters for eth3-02:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 1024
RX Mini: 0
RX Jumbo: 0
TX: 512

Pings to the edge router is solid with minimal loss. Pings to the cluster's external vIP (w/o VPN) is also solid.

Thanks

0 Kudos
israelfds95
Collaborator
Collaborator

Is the packet loss happening for LAN IPs on interface eth3-02 via RA VPN?
This is happening just on RA VPN? or in LAN testing from other machines you percept the same situation? 

One important point on interface eth3-02 is to adjust the RX/TX ring buffers to the maximum supported value (4096). This usually helps reduce RX-DROP counters and improves RX/TX performance on the interface:

set interface eth3-02 rx-ringsize 4096
set interface eth3-02 tx-ringsize 4096

save config

Normally, no reboot or cpstop; cpstart is required for this change to take effect.
However, if possible, performing a reboot later can be a good practice.

Reference:
How to increase the size of a ring buffer on Gaia OS for Intel NIC and Broadcom NIC
https://support.checkpoint.com/results/sk/sk42181

Another relevant point regarding RX-DROP on these interfaces is to verify the physical layer: connectors, cabling, and both firewall and switch ports, making sure everything is properly connected and error-free.

It’s also useful to run ping from the source host to the destination and test each hop along the path, trying to identify where packet loss starts occurring. Validate the full path end-to-end.

Based on that, validate the responses at each hop until you can clearly identify where the bottleneck is. This kind of issue can be hard to pinpoint and usually requires a structured, step-by-step troubleshooting approach.

0 Kudos
the_rock
MVP Diamond
MVP Diamond

Can you run ethtool -S on external interface and post the results please?

Best,
Andy
0 Kudos
Richard_Wieser
Contributor

# ethtool -S eth3-01
NIC statistics:
ifs_ibytes_hi: 48896
ifs_ibytes_lo: 3658302049
ifs_obytes_hi: 12721
ifs_obytes_lo: 3482159730
ifs_ipackets: 3997561350
ifs_opackets: 781934380
ifs_imcasts: 0
ifs_omcasts: 0
ifs_noproto: 0
ifs_ibcasts: 0
ifs_obcasts: 0
ifs_linkchanges: 0
ife_ierrors: 0
ife_oerrors: 0
ife_iqdrops: 0
ife_oqdrops: 0
iee_rx_missed: 0
rx_q0_packets: 1006105490
rx_q1_packets: 607304513
rx_q2_packets: 631652734
rx_q3_packets: 504148059
rx_q4_packets: 827473958
rx_q5_packets: 502861718
rx_q6_packets: 664985290
rx_q7_packets: 315651588
rx_q8_packets: 714069108
rx_q9_packets: 884484676
rx_q10_packets: 2165548462
rx_q11_packets: 2150313934
rx_q12_packets: 497831970
rx_q13_packets: 505753463
rx_q14_packets: 118434909
rx_q15_packets: 116657162
rx_q16_packets: 110652105
rx_q17_packets: 104117640
rx_q18_packets: 61716064
rx_q19_packets: 59423713
rx_q20_packets: 18346048
rx_q21_packets: 19788763
rx_q22_packets: 10605
rx_q23_packets: 10342
rx_q24_packets: 36660
rx_q25_packets: 12214
rx_q26_packets: 11185
rx_q27_packets: 52927
rx_q28_packets: 7938
rx_q29_packets: 9890
rx_q30_packets: 12314
rx_q31_packets: 10414
tx_q0_packets: 2974195200
tx_q1_packets: 332007692
tx_q2_packets: 36864168
tx_q3_packets: 4266557949
tx_q4_packets: 4245789008
tx_q5_packets: 9955395
tx_q6_packets: 313
tx_q7_packets: 178
tx_q8_packets: 1335432791
tx_q9_packets: 253559744
tx_q10_packets: 50147090
tx_q11_packets: 38850697
tx_q12_packets: 22518310
tx_q13_packets: 24
tx_q14_packets: 5
tx_q15_packets: 21
tx_q16_packets: 2728173556
tx_q17_packets: 78883520
tx_q18_packets: 4170121829
tx_q19_packets: 4286020273
tx_q20_packets: 4280188530
tx_q21_packets: 8588916
tx_q22_packets: 95
tx_q23_packets: 466
tx_q24_packets: 1371308077
tx_q25_packets: 250363063
tx_q26_packets: 45751553
tx_q27_packets: 37940200
tx_q28_packets: 23243746
tx_q29_packets: 49
tx_q30_packets: 27
tx_q31_packets: 6
tx_q32_packets: 242970
rx_good_packets: 3997561350
tx_good_packets: 781934380
rx_good_bytes: 3658302049
tx_good_bytes: 3482159730
rx_missed_errors: 111889044
rx_errors: 0
tx_errors: 0
rx_mbuf_allocation_errors: 0
rx_unicast_packets: 3901197675
rx_multicast_packets: 5587460
rx_broadcast_packets: 204502017
rx_dropped_packets: 0
rx_unknown_protocol_packets: 1836758
tx_unicast_packets: 778620776
tx_multicast_packets: 104039
tx_broadcast_packets: 3209565
tx_dropped_packets: 0
tx_link_down_dropped: 9
rx_crc_errors: 0
rx_illegal_byte_errors: 0
rx_error_bytes: 0
mac_local_errors: 1
mac_remote_errors: 12
rx_len_errors: 0
tx_xon_packets: 0
rx_xon_packets: 0
tx_xoff_packets: 0
rx_xoff_packets: 0
rx_size_64_packets: 4280691386
rx_size_65_to_127_packets: 3232002451
rx_size_128_to_255_packets: 2448230997
rx_size_256_to_511_packets: 368156231
rx_size_512_to_1023_packets: 1410107132
rx_size_1024_to_1522_packets: 962033547
rx_size_1523_to_max_packets: 0
rx_undersized_errors: 0
rx_oversize_errors: 0
rx_mac_short_pkt_dropped: 0
rx_fragmented_errors: 0
rx_jabber_errors: 0
tx_size_64_packets: 362352262
tx_size_65_to_127_packets: 2316693580
tx_size_128_to_255_packets: 1417296860
tx_size_256_to_511_packets: 3492555606
tx_size_512_to_1023_packets: 422889831
tx_size_1024_to_1522_packets: 1360080833
tx_size_1523_to_max_packets: 0

0 Kudos
the_rock
MVP Diamond
MVP Diamond

I dont see any rx or tx errors at all, looks fine to me.

Best,
Andy
0 Kudos
israelfds95
Collaborator
Collaborator

This first from netstat -ni show RX-DRP for both 

 

Iface       MTU    Met RX-OK                  RX-ERR     RX-DRP      RX-OVR   TX-OK               TX-ERR TX-DRP TX-OVR Flg
eth3-01 1500     0      184241747707    0                111870671   0              103782379871  0             0            0             ABMRU
eth3-02 1500     0       89235490191     0                 21345804    0              148776069625  0             6194      0             ABMRU


So, for me still valid point increase the rx/tx buffer from nic eth3-02, simple to fix, and will bring a kickly improvement, after that will be need take  a time to see if the RX-DROP will decrease, and follow the investigation

One important point on interface eth3-02 is to adjust the RX/TX ring buffers to the maximum supported value (4096). This usually helps reduce RX-DROP counters and improves RX/TX performance on the interface:

set interface eth3-02 rx-ringsize 4096
set interface eth3-02 tx-ringsize 4096

save config

Normally, no reboot or cpstop; cpstart is required for this change to take effect.
However, if possible, performing a reboot later can be a good practice.

Reference:
How to increase the size of a ring buffer on Gaia OS for Intel NIC and Broadcom NIC
https://support.checkpoint.com/results/sk/sk42181

(1)
Richard_Wieser
Contributor

I've maxed out the ring buffers and reboot the member in question. Now I have:

# ethtool -g eth3-01
Ring parameters for eth3-01:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096

# ethtool -g eth3-02
Ring parameters for eth3-02:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096

Kernel Interface table
Iface       MTU Met RX-OK         RX-ERR   RX-DRP   RX-OVR   TX-OK          TX-ERR   TX-DRP  TX-OVR Flg
eth3-01 1500 0      183220823  0               10025       0               109400885  0                0             0             ABMRU
eth3-02 1500 0      89822553    0               5               0                143595050  0               0              0             ABMRU

0 Kudos
israelfds95
Collaborator
Collaborator

Very good, this is a solid improvement on eth3-02. Now monitor netstat -ni  to see if the RX-DROP counter on eth3-01 starts growing too much.

Next, run latency and packet loss tests over the Remote Access VPN and share the results.




0 Kudos
Richard_Wieser
Contributor

RX-DRP on eth3-01 is still climbing, currently at 17063. Eth3-02 still at 5. Ping test is better.
Packets: Sent = 5232, Received = 5009, Lost = 223 (4% loss),
Approximate round trip times in milli-seconds:
Minimum = 8ms, Maximum = 71ms, Average = 19ms

This seems to vary with load. It's a lot better off business hours. Being Friday afternoon, there are less people connected, roughly 166 where the daily average is 250-270. 

0 Kudos
israelfds95
Collaborator
Collaborator

Yes, eth3-02 was resolved by adjusting the RX/TX buffer sizes. eth3-01 still needs further investigation and should be monitored over the next few days, especially when normal peak business traffic returns, to see if this RX-DROP will continue grow out of an acceptable way. 

There is still some packet loss, and this needs to be evaluated carefully. It’s important to validate all hops end-to-end and review all elements related to this interface, including the physical layer and the ISP next-hop. 

It’s also worth running hcp -r all to collect additional information from the health report.

Please keep us updated with any new findings.

Best regards

0 Kudos
Timothy_Hall
MVP Gold
MVP Gold

1) The presence of non-zero TX errors would suggest UPPAK is enabled, although it depends on whether your current code level was reached via upgrade or a fresh install.  Please provide output of fwaccel stat.

2) While the RX-DRP number does look concerning, it is a red herring.  The drop rate is 0.06% which is below the 0.1% guideline; based on your ethtool outputs these are "real" buffering misses and not discarded junk traffic such as unknown EtherTypes or improperly pruned VLAN tags.  Increasing the ring buffer size is not likely to make a difference and will probably make things worse in the long run by introducing jitter due to BufferBloat.  I'm surprised you are getting any real RX-DRPs at all if UPPAK is enabled unless the firewall is severely overloaded. Is Dynamic Split enabled to add more SND cores as needed to speed the emptying of ring buffers?  dynamic_balancing -p

As for the 50% loss specifically afflicting Remote Access VPN:

a) Please describe the client(s) you are using:  Mobile Access/SSL Extender?  SecuRemote?  Check Point Mobile?  EndPoint Security?  

b) Force use of visitor mode to TCP/443 with your Remote Access IPSec client (or turn it off if already forced).  Does the performance issue go away?  That would suggest an MTU/MSS Clamping issue or other issue with IPSec (which causes about 50% packet loss for full-size packets), although your site-to-site VPNs don't seem to be affected (unless they are already separately MSS clamped by you or your peer gateway(s)).

c) Check what algorithms are being used by your IPSec clients in Global Properties under Remote Access...VPN Authentication & Encryption...Encryption algorithms...Edit...IPSec Security Association (Phase 2).  Is it still 3DES/MD5?  May not be your complete problem but certainly not helping, and also may not be interacting well with UPPAK if it is enabled.

d) Replacement of switch may have changed the state of Ethernet Flow Control (Pause Frames) and whether they are still enabled on both sides.  Please provide output of ethtool -a and ethtool -i for affected interface.

e) As a last resort you can try disabling SecureXL acceleration of vpn traffic with vpn accel off then retest Remote Access VPN performance.  Be warned this will potentially disrupt all existing IPSec tunnels including site-to-site, and will move all IPSec processing out of SecureXL (especially if UPPAK is enabled) and back onto the Firewall Worker Instances.  Schedule a maintenance window before attempting this.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course
0 Kudos
Richard_Wieser
Contributor

  1. # fwaccel stat
    +---------------------------------------------------------------------------------+
    |Id|Name |Status |Interfaces |Features |
    +---------------------------------------------------------------------------------+
    |0 |UPPAK |enabled |Sync,eth1-05,eth1-06, |Acceleration,Cryptography |
    | | | |eth1-08,eth3-03,eth3-01, | |
    | | | |eth3-04,eth3-02 |Crypto: Tunnel,UDPEncap,MD5, |
    | | | | |SHA1,3DES,DES,AES-128,AES-256,|
    | | | | |ESP,LinkSelection,DynamicVPN, |
    | | | | |NatTraversal,AES-XCBC,SHA256, |
    | | | | |SHA384,SHA512 |
    +---------------------------------------------------------------------------------+

    Accept Templates : enabled
    Drop Templates : enabled
    NAT Templates : enabled
    LightSpeed Accel : disabled

  2. # dynamic_balancing -p
    Dynamic Balancing is currently On
    1. Endpoint Security VPN (Mostly E88.70 )
    2.  
    3. P1 supports DES :(, AES-128, AES-356 MD5, SHA1 SHA256 DH:2,5,14 using 2. Phase2: 3DES, AES-128, AES-256, data inegrity: sha1 DH Group2.
    4. # ethtool -a eth3-01
      Pause parameters for eth3-01:
      Autonegotiate: off
      RX: off
      TX: off
      # ethtool -i eth3-01
      driver: net_ice
      version: DPDK 20.11.7.7.0 (11 Jun 25)
      firmware-version: 4.30 0x8001b94f 1.3415.0
      expansion-rom-version:
      bus-info: 0000:b1:00.1
      supports-statistics: yes
      supports-test: no
      supports-eeprom-access: no
      supports-register-dump: no
      supports-priv-flags: yes

The SMS has been upgraded to get to this version, from the R60 days!😲

0 Kudos
Timothy_Hall
MVP Gold
MVP Gold

1) Please provide a screenshot of the IPSec Phase 2 screen for Remote Access VPN.  By default, it supports almost all algorithms but really only allows 3DES/MD5 to be used.  The checkbox at the bottom determines that.

2) The ice driver in use on eth3-01 has a known issue with improper balancing of IPSec traffic across SND cores (sk183525 - High CPU usage on one SND core), but supposedly UPPAK mode fixes that.  Using cpview (not top or any other Linux-based measuring tools), are any SND cores overloaded?  

3) Pause frames are off on the gateway, so the corresponding setting on the switch does not matter.  Any carrier transitions on eth3-01?  Run ifconfig eth3-01 and check the "carrier" value.  Check counters on the switchport side as well for errors or problems.

4) Try toggling Visitor Mode to see if it makes a difference, otherwise I'd suspect an issue with UPPAK mode.  You can try setting it back to KPPAK and see if that resolves the issue.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course
0 Kudos
Richard_Wieser
Contributor

# ifconfig eth3-01
eth3-01 Link encap:Ethernet HWaddr x:x:x:x
inet addr:x.x.x.x Bcast:x.x.x.x.x Mask:255.255.255.0
inet6 addr: x:x:x:x:x Scope:Global
inet6 addr: Scope:Link
UP BROADCAST RUNNING ALLMULTI MULTICAST MTU:1500 Metric:1
RX packets:791585415 errors:0 dropped:53078 overruns:0 frame:0
TX packets:482735314 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:2048
RX bytes:861738566609 (802.5 GiB) TX bytes:236663858482 (220.4 GiB)

0 Kudos
Timothy_Hall
MVP Gold
MVP Gold

RAS VPN settings look good, SNDs do not appear imbalanced.  Probably will need to involve TAC at this point; could still try vpn accel off or switching back to KPPAK mode to isolate whether the issue lies with SecureXL or not.

Could also try to use fw ctl zdebug + drop to determine if the 50% packet loss is being directly caused in the Check Point code itself (unlikely), or is some external factor (Gaia, switch, etc. - more likely).

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course
0 Kudos
the_rock
MVP Diamond
MVP Diamond

Technically, as a quick test, on top what Tim had said, you can also try fwaccel off, and see what happens. If no changel, run fwaccel on to turn sexl back on.

Best,
Andy
0 Kudos
the_rock
MVP Diamond
MVP Diamond

Definitely good idea @israelfds95 

Best,
Andy
0 Kudos
WiliRGasparetto
Contributor

Asymmetric forwarding / per-packet load-balancing introduced by the new switch

Remote Access VPN (typically IKE/UDP 500 + NAT-T UDP 4500, plus encrypted payload) is far more sensitive to out-of-order and state asymmetry than many S2S deployments (which might be pinned differently, use different selectors, different peers, or simply have fewer concurrent flows).

Common patterns after a switch upgrade:

  • EtherChannel/LACP hashing changed (or default hash algorithm differs) so UDP/4500 flows are not consistently pinned.
  • ECMP/per-packet load balancing enabled (even unintentionally) upstream/downstream.
  • Return traffic is not guaranteed to hit the same cluster member that established the IKE/IPsec SA.

Why this matches your symptom:
Time-of-day variation = congestion/microburst or load distribution changes as more RA users connect. If the “wrong” member sees return traffic or packets arrive out-of-order, you get apparent loss inside the tunnel even if raw interface counters look okay.


0 Kudos
WiliRGasparetto
Contributor

How to rule out “Check Point side” fast (with evidence)

  1. A) Confirm if loss is already on the wire or only inside the tunnel

On the active member that owns the external IP at the time of testing:

  1. Capture encrypted traffic on WAN

tcpdump -ni eth3-01 udp port 4500 or udp port 500

  • Run a sustained ping across the RA tunnel and watch:
    • Do you see the client’s UDP/4500 packets arriving consistently?
    • Do you see your replies leaving consistently?
  1. If packets arrive clean but loss is still seen inside tunnel, move to kernel drops / decryption path.

  1. B) Check interface health and drops (this is the “no excuses” baseline)

On each cluster member (both), collect:

ethtool -S eth3-01 | egrep -i "drop|discard|err|crc|over|miss|fifo|timeout"

ip -s link show dev eth3-01

What you want:

  • CRC / input errors should be ~0.
  • If you see rx_dropped / rx_missed_errors climbing, that’s a local receive path issue (driver, queueing, bursts).

Also grab:

cpview

Look specifically at:

  • CPU saturation / spikes at loss times
  • Interrupt load
  • Interface drops

 

  1. C) Check for kernel drops and acceleration constraints

On the active member during the test window:

fw ctl pstat

fwaccel stats -s

Indicators of a Check Point-side bottleneck:

  • High “dropped” counters in kernel/queues
  • Acceleration offloading flapping (less common, but you’ll see it)

If you suspect drops in the firewall path:

fw ctl zdebug drop | head

0 Kudos
WiliRGasparetto
Contributor

Asymmetric forwarding / per-packet load-balancing introduced by the new switch

Remote Access VPN (typically IKE/UDP 500 + NAT-T UDP 4500, plus encrypted payload) is far more sensitive to out-of-order and state asymmetry than many S2S deployments (which might be pinned differently, use different selectors, different peers, or simply have fewer concurrent flows).

Common patterns after a switch upgrade:

  • EtherChannel/LACP hashing changed (or default hash algorithm differs) so UDP/4500 flows are not consistently pinned.
  • ECMP/per-packet load balancing enabled (even unintentionally) upstream/downstream.
  • Return traffic is not guaranteed to hit the same cluster member that established the IKE/IPsec SA.

Why this matches your symptom:
Time-of-day variation = congestion/microburst or load distribution changes as more RA users connect. If the “wrong” member sees return traffic or packets arrive out-of-order, you get apparent loss inside the tunnel even if raw interface counters look okay.

0 Kudos
Richard_Wieser
Contributor

I'm currently seeing ~2% packet loss (ping test) which is a pretty good improvement. @Timothy_Hall is there documentation on forcing the RA client to use Vistor Mode? I see in the trac.defaults file there is 
"transport STRING Auto-Detect GW_USER 0"

0 Kudos
Timothy_Hall
MVP Gold
MVP Gold

So the only change you made that reduced packet loss from 50% to 2% was the ring buffer change?  Interesting, wouldn't think that would have such a big impact, but the rules may have changed with UPPAK.

Keep in mind that Visitor Mode reduces performance and can introduce bottlenecks in the user space process vpnd on the firewall.  The goal to forcing this mode would be to determine if your performance issue is being cause by inability to fragment IPSec/ESP packets or UDP-4500 (NAT-T) is rate-limited or not reliably delivered.  It is not really a permanent solution and will reduce overall VPN performance in the long run.  Visitor mode forces everything over TCP/443, and is dictated by the site download config, adjustable via GUIdbedit:  sk107433: How to change transport method with Endpoint Clients.  I'd assume you'd need to update or recreate the site for this to take effect.  You might be able to force this on an individual client only but I don't know how, probably something with that transport directive in the trac file, the options in GUIdbedit should guide you.

 

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course
0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events