While there are few existing threads discussing G-ARP, the solutions provided there do not seem to work for this situation.
I also think that this a scenario is encountered often enough to have its own thread.
The scenario is a pending HA cluster hardware swap. The goal is to avoid the 4 hour arp cache expiration problem.
arping does not work for vMAC.
Nor does it seem the fw ctl set int test_arp_refresh 1
Tested as follows (public IPs are fake, R81.10):
Expected G-ARP packet capture for connected router provoked by “arping -c 4 -A -I eth4 200.100.0.2” from one of the cluster members:
root@router:/home/vyos# tcpdump -ni eth1 -c4 broadcast and arp and arp[6:2] == 2
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), capture size 65535 bytes
12:52:09.345639 ARP, Reply 200.100.0.2 is-at 08:00:27:f5:0e:39, length 46
12:52:10.345408 ARP, Reply 200.100.0.2 is-at 08:00:27:f5:0e:39, length 46
12:52:11.346263 ARP, Reply 200.100.0.2 is-at 08:00:27:f5:0e:39, length 46
12:52:12.346336 ARP, Reply 200.100.0.2 is-at 08:00:27:f5:0e:39, length 46
4 packets captured
4 packets received by filter
0 packets dropped by kernel
root@router:/home/vyos#
Output of the same on the cluster member's interface connected to the router (same on both members):
[Expert@CPCM1:0]# tcpdump -ni eth4 -c4 broadcast and arp and arp[6:2] == 2
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth4, link-type EN10MB (Ethernet), capture size 262144 bytes
12:53:30.011708 ARP, Reply 200.100.0.2 is-at 08:00:27:f5:0e:39, length 28
12:53:31.012288 ARP, Reply 200.100.0.2 is-at 08:00:27:f5:0e:39, length 28
12:53:32.013320 ARP, Reply 200.100.0.2 is-at 08:00:27:f5:0e:39, length 28
12:53:33.013524 ARP, Reply 200.100.0.2 is-at 08:00:27:f5:0e:39, length 28
4 packets captured
4 packets received by filter
0 packets dropped by kernel
[Expert@CPCM1:0]#
I presume, we are expecting to see the same, but with the vMAC when we are using "fw ctl set int test_arp_refresh 1"
But when we are doing it:
[Expert@CPCM1:0]# fw ctl set int test_arp_refresh 1
[Expert@CPCM1:0]#
We are not seeing anything:
[Expert@CPCM1:0]# tcpdump -ni eth4 -c4 broadcast and arp and arp[6:2] == 2
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth4, link-type EN10MB (Ethernet), capture size 262144 bytes
When executing failover on the active cluster member:
[Expert@CPCM1:0]# cphaprob stat
Cluster Mode: High Availability (Active Up) with IGMP Membership
ID Unique Address Assigned Load State Name
1 (local) 192.168.255.2 100% ACTIVE CPCM1
2 192.168.255.3 0% DOWN CPCM2
Active PNOTEs: None
Last member state change event:
Event Code: CLUS-114904
State change: ACTIVE(!) -> ACTIVE
Reason for state change: Reason for ACTIVE! alert has been resolved
Event time: Wed Aug 11 12:26:42 2021
Last cluster failover event:
Transition to new ACTIVE: Member 2 -> Member 1
Reason: ADMIN_DOWN PNOTE
Event time: Wed Aug 11 11:21:20 2021
Cluster failover count:
Failover counter: 2
Time of counter reset: Wed Aug 11 10:29:31 2021 (reboot)
[Expert@CPCM1:0]#
With vMAC configured:
[Expert@CPCM1:0]# cphaprob -a if
CCP mode: Manual (Unicast)
Required interfaces: 3
Required secured interfaces: 1
Interface Name: Status:
eth0 UP
eth3 (S) UP
eth4 UP
S - sync, LM - link monitor, HA/LS - bond type
Virtual cluster interfaces: 2
eth0 10.0.0.1 VMAC address: 00:1C:7F:00:33:61
eth4 200.100.0.1 VMAC address: 00:1C:7F:00:33:61
[Expert@CPCM1:0]#
Weirder yet, is that I am not seeing G-ARP on either cluster member when failing over successfully:
[Expert@CPCM1:0]# clusterXL_admin down
This command does not survive reboot. To make the change permanent, run either the 'set cluster member admin {down|up} permanent' command in Gaia Clish, or the 'clusterXL_admin {down|up} -p' command in Expert mode
Setting member to administratively down state ...
Member current state is DOWN
[Expert@CPCM1:0]#
We are not seeing G-ARP requests on either cluster members, (contrary to what was implied in the CheckMated thread https://community.checkpoint.com/t5/Security-Gateways/How-to-send-G-ARP-manually/m-p/69914)
[Expert@CPCM1:0]# tcpdump -ni eth4 -c4 broadcast and arp and arp[6:2] == 2
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth4, link-type EN10MB (Ethernet), capture size 262144 bytes
[Expert@CPCM2:0]# tcpdump -ni eth4 -c4 broadcast and arp and arp[6:2] == 2
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth4, link-type EN10MB (Ethernet), capture size 262144 bytes
While actual failover is taking place successfully:
[Expert@CPCM1:0]# cphaprob stat
Cluster Mode: High Availability (Active Up) with IGMP Membership
ID Unique Address Assigned Load State Name
1 (local) 192.168.255.2 0% DOWN CPCM1
2 192.168.255.3 100% ACTIVE CPCM2
Active PNOTEs: ADMIN
Last member state change event:
Event Code: CLUS-111400
State change: ACTIVE -> DOWN
Reason for state change: ADMIN_DOWN PNOTE
Event time: Wed Aug 11 13:08:48 2021
Last cluster failover event:
Transition to new ACTIVE: Member 1 -> Member 2
Reason: ADMIN_DOWN PNOTE
Event time: Wed Aug 11 13:08:47 2021
Cluster failover count:
Failover counter: 3
Time of counter reset: Wed Aug 11 10:29:31 2021 (reboot)
[Expert@CPCM1:0]#
[Expert@CPCM2:0]# cphaprob stat
Cluster Mode: High Availability (Active Up) with IGMP Membership
ID Unique Address Assigned Load State Name
1 192.168.255.2 0% DOWN CPCM1
2 (local) 192.168.255.3 100% ACTIVE CPCM2
Active PNOTEs: None
Last member state change event:
Event Code: CLUS-114704
State change: STANDBY -> ACTIVE
Reason for state change: No other ACTIVE members have been found in the cluster
Event time: Wed Aug 11 13:08:47 2021
Last cluster failover event:
Transition to new ACTIVE: Member 1 -> Member 2
Reason: ADMIN_DOWN PNOTE
Event time: Wed Aug 11 13:08:47 2021
Cluster failover count:
Failover counter: 3
Time of counter reset: Wed Aug 11 10:29:31 2021 (reboot)
[Expert@CPCM2:0]#
[Expert@CPCM2:0]# cphaprob stat
Cluster Mode: High Availability (Active Up) with IGMP Membership
ID Unique Address Assigned Load State Name
1 192.168.255.2 0% DOWN CPCM1
2 (local) 192.168.255.3 100% ACTIVE CPCM2
Active PNOTEs: None
Last member state change event:
Event Code: CLUS-114704
State change: STANDBY -> ACTIVE
Reason for state change: No other ACTIVE members have been found in the cluster
Event time: Wed Aug 11 13:08:47 2021
Last cluster failover event:
Transition to new ACTIVE: Member 1 -> Member 2
Reason: ADMIN_DOWN PNOTE
Event time: Wed Aug 11 13:08:47 2021
Cluster failover count:
Failover counter: 3
Time of counter reset: Wed Aug 11 10:29:31 2021 (reboot)
[Expert@CPCM2:0]#
It'll be great to hear from someone who has tackled this issue successfully in the field.
Thank you,
Vladimir