Solved: Spark 1595 R81.10.08 Cluster VIP requirement on sw...

Andy1977 · ‎2024-01-26

We have 2 Spark 1595 appliance R81.10.08 in local management that forms a simple HA cluster with following topology:

WAN: VIP 202.175.116.210/30, Primary member: 192.168.100.1, Secondary member: 192.168.100.2

(Since ISP only provide 1 public IP, the Cluster VIP is cross different subnets as this feature is now supported in version R81.10.X)

LAN1: VIP 192.168.1.1, Primary member: 192.168.1.253, Secondary member: 192.168.1.254

LAN2 (Sync): 10.231.149.1 and 10.231.149.2

The Switch Ports 27, 28 and 29 formed an isolated VLAN in access mode, and connect to the two WAN ports on Firewall A and B and ISP router.

We found there is 3-4% PING packet lost when sending 1000 PING packet from Firewall to 8.8.8.8 if the firewall is in HA mode. But no PING lost if the firewall is in standalone with the same connection.

Could anyone suggest what is the best practice or requirements on switch ports for Cluster VIP connections? Thanks.

K_R_V · ‎2024-01-30

indeed, that is exactly how to check if this is the same issue.

expert mode, tcpdump -ni interfacename host 8.8.8.8.

I would open a case anyway and upgrade, there are other fixes in the 1711 version that you want, but are not yet documented publicly. ( as from my experience with many tac cases )

View solution in original post

PhoneBoy · ‎2024-01-27

Best to consult with the TAC here: https://help.checkpoint.com

Chris_Atkinson · ‎2024-01-27

Which build of R81.10.08 firmware?

Are the switch ports configured for portfast/edge?

Do the firewalls report any failover events corresponding with the packet loss?

CCSM R77/R80/ELITE

Andy1977 · ‎2024-01-28

Hi Chris,

The build version is R81.10.08 Build 683.

Firewall do not report any failover events as the packet loss is around 4-6 over 1000 PINGS.

For switch port, it's under client's management and they didn't provide much information. I will double check with them whether it's portfast or edge. So, if the switch port is connecting WAN port on firewall and the ISP router, it should be configured as edge port?

How about the switch port connecting the LAN ports to internal networks? These ports should be in portfast mode then?

K_R_V · ‎2024-01-29

Are using VMAC , as there is an issue with G-ARP in this version causing this behavior?

cphaprob -a if and check if there is VMAC address after the VIP .

Andy1977 · ‎2024-01-29

Yes, there is VMAC address after the VIP. Below is part of the output:

Virtual cluster interfaces: 3

WAN 202.175.116.210 VMAC address: 00:1C:7F:00:71:08
LAN1 192.168.1.1 VMAC address: 00:1C:7F:00:71:08
WAN3 149.102.97.128 VMAC address: 00:1C:7F:00:71:08

We found whenever HA is formed, the PING packet loss happens. And more interface involved in HA, more packet PING loss. We checked cables, switch ports etc., but still problem exist.

As you said there is an issue with G-ARP in this version, may I know if there is any reference or link that I can refer to? Thanks.

Andy1977 · ‎2024-01-29

Some updates on the issue.

I traveled to client's office to have an onsite look of the issue. After doing a number of PINGS at different network segments, I saw most PING loss occurs internally from end client PC to Checkpoint firewalls, and ifconfig -a show LAN1 which is internal network has RX packets drop rate at 0.56%.

Then client told me their network flow is:

Endpoint PC > Core Switch > Huawei Firewall (standalone) > Checkpoint Firewall (Cluster) > switch > ISP router

I saw client connect the LAN1 port from Checkpoint Firewall A and B directly to 2 ports on the Huawei Firewall. I doubt is this feasible? Shouldn't there be a switch between Checkpoint firewall and the Huawei firewall? I doubt the Checkpoint cluster VIP cannot be handled properly at Huawei firewall if connected directly?

AmirArama · ‎2024-01-30

Have you checked performance (cpu) on the machine?

Rx drops happens many times because there is no available cpu to empty the nic buffer in time.

*Garp should not be the cause of random packet loss, but mainly outage on cluster failover.

Andy1977 · ‎2024-01-30

Yes, I had checked the cpu performance. All are low usage, and the peak active connections for each cpu is between 250-280 only.

K_R_V · ‎2024-01-30

Hello,

If you're facing the same problem, I dealt with it for a few months and finally found a solution with TAC that is implemented in version R81.10.08 1690. To get the fix, contact TAC. I suggest asking for R81.10.08 build 1711—it works well.

https://community.checkpoint.com/t5/SMB-Gateways-Spark/R81-10-07-and-08-Centrally-managed-ClusterXL-...

Andy1977 · ‎2024-01-30

I saw you post before posting my one. My client's current version is 1595 Appliance R81.10.08 - Build 683. But I saw this is already the latest firmware that I can download or by clicking update on the appliance directly. So I need to open TAC for a special version separately? Thanks.

K_R_V · ‎2024-01-30

Hello,

Indeed, latest public version is 683 but this VMAC issue is fixed as from 690. Best to open a TAC case and ask for build 711 as we had another issue in the 690 build that seems to be fixed in this version.

to know if it is the same issue : do a tcpdump on the standby firewalls while doing a continuous ping. All pings with no reply are arriving on the standby firewall !

Andy1977 · ‎2024-01-30

Great appreciate for your help.

So I do a continuous PING from endpoint PC to external address said 8.8.8.8, then do a tcpdump on both firewalls to observe where is the missing PING. And if all missing PINGs are at standby firewall, then it's the same issue as yours? Any recommended tcpdump command?

K_R_V · ‎2024-01-30

indeed, that is exactly how to check if this is the same issue.

expert mode, tcpdump -ni interfacename host 8.8.8.8.

I would open a case anyway and upgrade, there are other fixes in the 1711 version that you want, but are not yet documented publicly. ( as from my experience with many tac cases )

Andy1977 · ‎2024-01-30

Big thanks for your suggest. I will do the test. Hopefully can prove it's the same issue.

Andy1977 · ‎2024-02-02

Today I went to client's site and performed the test. I saw the lost PING packets are at standby firewalls. I think it hits the same issue. Thanks again for your advise.

K_R_V · ‎2024-02-03

I'm pleased to see that my three-month effort with TAC, including the frustrations, escalations, and lab setup to validate my point, has benefited you!

It's a bit sad that this solution isn't yet documented or incorporated into the latest public release, as this is already 2 months ago I got a fixed version.

FYI , this problem is also in R81.10 for non spark devices, but already fixed in Jumbo 128 ( https://support.checkpoint.com/results/sk/sk181645 )

Are you a member of CheckMates?

Spark 1595 R81.10.08 Cluster VIP requirement on switch ports