Greetings,
I'm facing issues with high latency on a CPAP-SG3600 HA-cluster. I have a TAC-case on this topic, but I wanted to check if anyone has some pointers.
This HA-cluster has been doing just fine until mid-February. Then we started facing periods with extreme latency through the active firewall (3000 ms+), and forcing a failover would always solve it. Then the issue would just re-occur a few hours later. Rinse and repeat.
The HA-cluster has been running Gaia R81.20 GA since November without issues until mid-February. In late February, we applied R81.20 JHF Take 8, but it made no difference. Last week I did a clean install from USB on both gateways. I did also run HDT (Hardware Diagnostic Tools). But still the same behaviour.
After digging through /var/log/messages I noticed messages pointing towards issues with PrioQ. After disabling PrioQ the latency issues went away. But they keep returning, not as frequently as earlier, but it still happens often. I also stumbled upon sk180437 - Unexpected traffic latency or outage on a Security Gateway / Cluster after policy installation. Noticed we had similar messages in /var/log/messages as referenced in the SK, so I applied the solution. This didn't seem to change anything.
https://support.checkpoint.com/results/sk/sk180437
Upon reviewing my latest Health CheckPoint (HCP) report, I noticed it complaining about:
F2F rate is high. Can be reduced by optimizing rule-base, changing blades or additional configurations - check 'sk98348' section (3-5). packets in the last 5 seconds: 214177, slow path packets: 212613, percentage: 99.269762859690815%
This struck me as quite odd. We don't have anything in our firewall policy that should hamper SecureXL in such a way:
[Expert@:0]# fwaccel stat
+---------------------------------------------------------------------------------+
|Id|Name |Status |Interfaces |Features |
+---------------------------------------------------------------------------------+
|0 |KPPAK |enabled |eth5,Mgmt,eth1,eth2,eth3,|Acceleration,Cryptography |
| | | |eth4 | |
| | | | |Crypto: Tunnel,UDPEncap,MD5, |
| | | | |SHA1,3DES,DES,AES-128,AES-256,|
| | | | |ESP,LinkSelection,DynamicVPN, |
| | | | |NatTraversal,AES-XCBC,SHA256, |
| | | | |SHA384,SHA512 |
+---------------------------------------------------------------------------------+
Accept Templates : enabled
Drop Templates : enabled
NAT Templates : enabled
LightSpeed Accel : disabled
But we are clearly having issues with accelerated traffic as pretty much all traffic is hitting F2F:
[Expert@:0]# fwaccel stats -s
Accelerated conns/Total conns : 22/687 (3%)
LightSpeed conns/Total conns : 0/687 (0%)
Accelerated pkts/Total pkts : 96180156/7276233558 (1%)
LightSpeed pkts/Total pkts : 0/7276233558 (0%)
F2Fed pkts/Total pkts : 7180053402/7276233558 (98%)
F2V pkts/Total pkts : 3707650/7276233558 (0%)
CPASXL pkts/Total pkts : 188774/7276233558 (0%)
PSLXL pkts/Total pkts : 87362563/7276233558 (1%)
CPAS pipeline pkts/Total pkts : 0/7276233558 (0%)
PSL pipeline pkts/Total pkts : 0/7276233558 (0%)
QOS inbound pkts/Total pkts : 0/7276233558 (0%)
QOS outbound pkts/Total pkts : 0/7276233558 (0%)
Corrected pkts/Total pkts : 0/7276233558 (0%)
[Expert@:0]# fwaccel stats -p
F2F packets:
--------------
Violation Packets Violation Packets
-------------------- --------------- -------------------- ---------------
Pkt has IP options 12 ICMP miss conn 1625812
TCP-SYN miss conn 4770065 TCP-other miss conn 24372326
UDP miss conn 3562308642 Other miss conn 242
VPN returned F2F 1106 Uni-directional viol 0
Possible spoof viol 0 TCP state viol 109
SCTP state affecting 0 Out if not def/accl 0
Bridge src=dst 0 Routing decision err 0
Sanity checks failed 0 Fwd to non-pivot 0
Broadcast/multicast 0 Cluster message 25468811
Cluster forward 9483581 Chain forwarding 0
F2V conn match pkts 4707 General reason 0
Route changes 0 VPN multicast traffic 0
GTP non-accelerated 0 Unresolved nexthop 29
[Expert@:0]# fwaccel stats
Name Value Name Value
---------------------------- ------------------- ---------------------------- -------------------
LightSpeed Accelerated Path
--------------------------------------------------------------------------------------------------------
hw accel inbound bytes 0 hw accel packets 0
hw accel outbound bytes 0 hw accel conns 0
hw accel total conns 0 hw accel tcp conns 0
hw accel non-tcp conns 0
Accelerated Path
--------------------------------------------------------------------------------------------------------
accel packets 96203817 accel bytes 65663576815
outbound packets 96203313 outbound bytes 65787905840
conns created 3423544 conns deleted 3422909
C total conns 635 C TCP conns 486
C non TCP conns 149 nat conns 2671316
dropped packets 7704 dropped bytes 775666
fragments received 592 fragments transmit 0
fragments dropped 0 fragments expired 592
IP options dropped 0 corrs created 0
corrs deleted 0 C corrections 0
corrected packets 0 corrected bytes 0
Accelerated VPN Path
--------------------------------------------------------------------------------------------------------
C crypt conns 2 enc bytes 780268880
dec bytes 50958400 ESP enc pkts 1050432
ESP enc err 136 ESP dec pkts 554343
ESP dec err 0 ESP other err 1
espudp enc pkts 0 espudp enc err 0
espudp dec pkts 0 espudp dec err 0
espudp other err 0
Medium Streaming Path
--------------------------------------------------------------------------------------------------------
CPASXL packets 188774 PSLXL packets 87384461
CPASXL async packets 188774 PSLXL async packets 78909801
CPASXL bytes 179631938 PSLXL bytes 61559666392
C CPASXL conns 0 C PSLXL conns 613
CPASXL conns created 450 PSLXL conns created 3416719
PXL FF conns 0 PXL FF packets 8473823
PXL FF bytes 6982538553 PXL FF acks 3525076
PXL no conn drops 0
Pipeline Streaming Path
--------------------------------------------------------------------------------------------------------
PSL Pipeline packets 0 PSL Pipeline bytes 0
CPAS Pipeline packets 0 CPAS Pipeline bytes 0
QoS Paths
--------------------------------------------------------------------------------------------------------
QoS General Information:
------------------------
Total QoS Conns 0 QoS Classify Conns 0
QoS Classify flow 0 Reclassify QoS policy 0
FireWall QoS Path:
------------------
Enqueued IN packets 0 Enqueued OUT packets 0
Dequeued IN packets 0 Dequeued OUT packets 0
Enqueued IN bytes 0 Enqueued OUT bytes 0
Dequeued IN bytes 0 Dequeued OUT bytes 0
Accelerated QoS Path:
---------------------
Enqueued IN packets 0 Enqueued OUT packets 0
Dequeued IN packets 0 Dequeued OUT packets 0
Enqueued IN bytes 0 Enqueued OUT bytes 0
Dequeued IN bytes 0 Dequeued OUT bytes 0
Firewall Path
--------------------------------------------------------------------------------------------------------
F2F packets 7182789713 F2F bytes 1378432955760
TCP violations 109 F2V conn match pkts 4707
F2V packets 3709013 F2V bytes 239565290
GTP
--------------------------------------------------------------------------------------------------------
gtp tunnels created 0 gtp tunnels 0
gtp accel pkts 0 gtp f2f pkts 0
gtp spoofed pkts 0 gtp in gtp pkts 0
gtp signaling pkts 0 gtp tcpopt pkts 0
gtp apn err pkts 0
General
--------------------------------------------------------------------------------------------------------
memory used 40405632 C tcp handshake conns 243
C tcp established conns 218 C tcp closed conns 25
C tcp pxl handshake conns 243 C tcp pxl established conns 203
C tcp pxl closed conns 25 DNS DoR stats 21
(*) Statistics marked with C refer to current value, others refer to total value
As a temporary workaround, we have disabled all threat-prevention blades. These gateways aren't crazy powerful. I guess it makes sense for it to start showing performance issues when barely any traffic is getting accelerated. And I suppose the problems related PrioQ is most likely a result of other things, not a trigger for the latency issues.
The question is why so much traffic is hitting F2F. I have examined the firewall policy, which consists of 116 rules. The first rule containing applications is rule 107, an in-line layer for outbound traffic for a specific subnet. All rules having applications are within in-line layers towards the bottom of the policy package. I have a really hard time understanding why so little of the traffic is being accelerated.
Does anyone else have any experience with this? Any pointers to what I should look for to figure out and solve this behaviour?
Certifications: CCSA, CCSE, CCSM, CCSM ELITE, CCTA, CCTE, CCVS, CCME