Forcing Gratuitous ARP (G-ARP) from ClusterXL with... - Page 2

Vladimir · ‎2021-08-11

While there are few existing threads discussing G-ARP, the solutions provided there do not seem to work for this situation.

I also think that this a scenario is encountered often enough to have its own thread.

The scenario is a pending HA cluster hardware swap. The goal is to avoid the 4 hour arp cache expiration problem.

arping does not work for vMAC.

Nor does it seem the fw ctl set int test_arp_refresh 1

Tested as follows (public IPs are fake, R81.10):

Expected G-ARP packet capture for connected router provoked by “arping -c 4 -A -I eth4 200.100.0.2” from one of the cluster members:

root@router:/home/vyos# tcpdump -ni eth1 -c4 broadcast and arp and arp[6:2] == 2

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on eth1, link-type EN10MB (Ethernet), capture size 65535 bytes

12:52:09.345639 ARP, Reply 200.100.0.2 is-at 08:00:27:f5:0e:39, length 46

12:52:10.345408 ARP, Reply 200.100.0.2 is-at 08:00:27:f5:0e:39, length 46

12:52:11.346263 ARP, Reply 200.100.0.2 is-at 08:00:27:f5:0e:39, length 46

12:52:12.346336 ARP, Reply 200.100.0.2 is-at 08:00:27:f5:0e:39, length 46

4 packets captured

4 packets received by filter

0 packets dropped by kernel

root@router:/home/vyos#

Output of the same on the cluster member's interface connected to the router (same on both members):

[Expert@CPCM1:0]# tcpdump -ni eth4 -c4 broadcast and arp and arp[6:2] == 2

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on eth4, link-type EN10MB (Ethernet), capture size 262144 bytes

12:53:30.011708 ARP, Reply 200.100.0.2 is-at 08:00:27:f5:0e:39, length 28

12:53:31.012288 ARP, Reply 200.100.0.2 is-at 08:00:27:f5:0e:39, length 28

12:53:32.013320 ARP, Reply 200.100.0.2 is-at 08:00:27:f5:0e:39, length 28

12:53:33.013524 ARP, Reply 200.100.0.2 is-at 08:00:27:f5:0e:39, length 28

4 packets captured

4 packets received by filter

0 packets dropped by kernel

[Expert@CPCM1:0]#

I presume, we are expecting to see the same, but with the vMAC when we are using "fw ctl set int test_arp_refresh 1"

But when we are doing it:

[Expert@CPCM1:0]# fw ctl set int test_arp_refresh 1

[Expert@CPCM1:0]#

We are not seeing anything:

[Expert@CPCM1:0]# tcpdump -ni eth4 -c4 broadcast and arp and arp[6:2] == 2

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on eth4, link-type EN10MB (Ethernet), capture size 262144 bytes

When executing failover on the active cluster member:

[Expert@CPCM1:0]# cphaprob stat

Cluster Mode: High Availability (Active Up) with IGMP Membership

ID Unique Address Assigned Load State Name

1 (local) 192.168.255.2 100% ACTIVE CPCM1

2 192.168.255.3 0% DOWN CPCM2

Active PNOTEs: None

Last member state change event:

Event Code: CLUS-114904

State change: ACTIVE(!) -> ACTIVE

Reason for state change: Reason for ACTIVE! alert has been resolved

Event time: Wed Aug 11 12:26:42 2021

Last cluster failover event:

Transition to new ACTIVE: Member 2 -> Member 1

Reason: ADMIN_DOWN PNOTE

Event time: Wed Aug 11 11:21:20 2021

Cluster failover count:

Failover counter: 2

Time of counter reset: Wed Aug 11 10:29:31 2021 (reboot)

[Expert@CPCM1:0]#

With vMAC configured:

[Expert@CPCM1:0]# cphaprob -a if

CCP mode: Manual (Unicast)

Required interfaces: 3

Required secured interfaces: 1

Interface Name: Status:

eth0 UP

eth3 (S) UP

eth4 UP

S - sync, LM - link monitor, HA/LS - bond type

Virtual cluster interfaces: 2

eth0 10.0.0.1 VMAC address: 00:1C:7F:00:33:61

eth4 200.100.0.1 VMAC address: 00:1C:7F:00:33:61

[Expert@CPCM1:0]#

Weirder yet, is that I am not seeing G-ARP on either cluster member when failing over successfully:

[Expert@CPCM1:0]# clusterXL_admin down

This command does not survive reboot. To make the change permanent, run either the 'set cluster member admin {down|up} permanent' command in Gaia Clish, or the 'clusterXL_admin {down|up} -p' command in Expert mode

Setting member to administratively down state ...

Member current state is DOWN

[Expert@CPCM1:0]#

We are not seeing G-ARP requests on either cluster members, (contrary to what was implied in the CheckMated thread https://community.checkpoint.com/t5/Security-Gateways/How-to-send-G-ARP-manually/m-p/69914)

[Expert@CPCM1:0]# tcpdump -ni eth4 -c4 broadcast and arp and arp[6:2] == 2

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on eth4, link-type EN10MB (Ethernet), capture size 262144 bytes

[Expert@CPCM2:0]# tcpdump -ni eth4 -c4 broadcast and arp and arp[6:2] == 2

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on eth4, link-type EN10MB (Ethernet), capture size 262144 bytes

While actual failover is taking place successfully:

[Expert@CPCM1:0]# cphaprob stat

Cluster Mode: High Availability (Active Up) with IGMP Membership

ID Unique Address Assigned Load State Name

1 (local) 192.168.255.2 0% DOWN CPCM1

2 192.168.255.3 100% ACTIVE CPCM2

Active PNOTEs: ADMIN

Last member state change event:

Event Code: CLUS-111400

State change: ACTIVE -> DOWN

Reason for state change: ADMIN_DOWN PNOTE

Event time: Wed Aug 11 13:08:48 2021

Last cluster failover event:

Transition to new ACTIVE: Member 1 -> Member 2

Reason: ADMIN_DOWN PNOTE

Event time: Wed Aug 11 13:08:47 2021

Cluster failover count:

Failover counter: 3

Time of counter reset: Wed Aug 11 10:29:31 2021 (reboot)

[Expert@CPCM1:0]#

[Expert@CPCM2:0]# cphaprob stat

Cluster Mode: High Availability (Active Up) with IGMP Membership

ID Unique Address Assigned Load State Name

1 192.168.255.2 0% DOWN CPCM1

2 (local) 192.168.255.3 100% ACTIVE CPCM2

Active PNOTEs: None

Last member state change event:

Event Code: CLUS-114704

State change: STANDBY -> ACTIVE

Reason for state change: No other ACTIVE members have been found in the cluster

Event time: Wed Aug 11 13:08:47 2021

Last cluster failover event:

Transition to new ACTIVE: Member 1 -> Member 2

Reason: ADMIN_DOWN PNOTE

Event time: Wed Aug 11 13:08:47 2021

Cluster failover count:

Failover counter: 3

Time of counter reset: Wed Aug 11 10:29:31 2021 (reboot)

[Expert@CPCM2:0]#

[Expert@CPCM2:0]# cphaprob stat

Cluster Mode: High Availability (Active Up) with IGMP Membership

ID Unique Address Assigned Load State Name

1 192.168.255.2 0% DOWN CPCM1

2 (local) 192.168.255.3 100% ACTIVE CPCM2

Active PNOTEs: None

Last member state change event:

Event Code: CLUS-114704

State change: STANDBY -> ACTIVE

Reason for state change: No other ACTIVE members have been found in the cluster

Event time: Wed Aug 11 13:08:47 2021

Last cluster failover event:

Transition to new ACTIVE: Member 1 -> Member 2

Reason: ADMIN_DOWN PNOTE

Event time: Wed Aug 11 13:08:47 2021

Cluster failover count:

Failover counter: 3

Time of counter reset: Wed Aug 11 10:29:31 2021 (reboot)

[Expert@CPCM2:0]#

It'll be great to hear from someone who has tackled this issue successfully in the field.

Thank you,

Vladimir

Are you a member of CheckMates?

Forcing Gratuitous ARP (G-ARP) from ClusterXL with vMAC