Forcing Gratuitous ARP (G-ARP) from ClusterXL with...

Vladimir · ‎2021-08-11

While there are few existing threads discussing G-ARP, the solutions provided there do not seem to work for this situation.

I also think that this a scenario is encountered often enough to have its own thread.

The scenario is a pending HA cluster hardware swap. The goal is to avoid the 4 hour arp cache expiration problem.

arping does not work for vMAC.

Nor does it seem the fw ctl set int test_arp_refresh 1

Tested as follows (public IPs are fake, R81.10):

Expected G-ARP packet capture for connected router provoked by “arping -c 4 -A -I eth4 200.100.0.2” from one of the cluster members:

root@router:/home/vyos# tcpdump -ni eth1 -c4 broadcast and arp and arp[6:2] == 2

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on eth1, link-type EN10MB (Ethernet), capture size 65535 bytes

12:52:09.345639 ARP, Reply 200.100.0.2 is-at 08:00:27:f5:0e:39, length 46

12:52:10.345408 ARP, Reply 200.100.0.2 is-at 08:00:27:f5:0e:39, length 46

12:52:11.346263 ARP, Reply 200.100.0.2 is-at 08:00:27:f5:0e:39, length 46

12:52:12.346336 ARP, Reply 200.100.0.2 is-at 08:00:27:f5:0e:39, length 46

4 packets captured

4 packets received by filter

0 packets dropped by kernel

root@router:/home/vyos#

Output of the same on the cluster member's interface connected to the router (same on both members):

[Expert@CPCM1:0]# tcpdump -ni eth4 -c4 broadcast and arp and arp[6:2] == 2

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on eth4, link-type EN10MB (Ethernet), capture size 262144 bytes

12:53:30.011708 ARP, Reply 200.100.0.2 is-at 08:00:27:f5:0e:39, length 28

12:53:31.012288 ARP, Reply 200.100.0.2 is-at 08:00:27:f5:0e:39, length 28

12:53:32.013320 ARP, Reply 200.100.0.2 is-at 08:00:27:f5:0e:39, length 28

12:53:33.013524 ARP, Reply 200.100.0.2 is-at 08:00:27:f5:0e:39, length 28

4 packets captured

4 packets received by filter

0 packets dropped by kernel

[Expert@CPCM1:0]#

I presume, we are expecting to see the same, but with the vMAC when we are using "fw ctl set int test_arp_refresh 1"

But when we are doing it:

[Expert@CPCM1:0]# fw ctl set int test_arp_refresh 1

[Expert@CPCM1:0]#

We are not seeing anything:

[Expert@CPCM1:0]# tcpdump -ni eth4 -c4 broadcast and arp and arp[6:2] == 2

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on eth4, link-type EN10MB (Ethernet), capture size 262144 bytes

When executing failover on the active cluster member:

[Expert@CPCM1:0]# cphaprob stat

Cluster Mode: High Availability (Active Up) with IGMP Membership

ID Unique Address Assigned Load State Name

1 (local) 192.168.255.2 100% ACTIVE CPCM1

2 192.168.255.3 0% DOWN CPCM2

Active PNOTEs: None

Last member state change event:

Event Code: CLUS-114904

State change: ACTIVE(!) -> ACTIVE

Reason for state change: Reason for ACTIVE! alert has been resolved

Event time: Wed Aug 11 12:26:42 2021

Last cluster failover event:

Transition to new ACTIVE: Member 2 -> Member 1

Reason: ADMIN_DOWN PNOTE

Event time: Wed Aug 11 11:21:20 2021

Cluster failover count:

Failover counter: 2

Time of counter reset: Wed Aug 11 10:29:31 2021 (reboot)

[Expert@CPCM1:0]#

With vMAC configured:

[Expert@CPCM1:0]# cphaprob -a if

CCP mode: Manual (Unicast)

Required interfaces: 3

Required secured interfaces: 1

Interface Name: Status:

eth0 UP

eth3 (S) UP

eth4 UP

S - sync, LM - link monitor, HA/LS - bond type

Virtual cluster interfaces: 2

eth0 10.0.0.1 VMAC address: 00:1C:7F:00:33:61

eth4 200.100.0.1 VMAC address: 00:1C:7F:00:33:61

[Expert@CPCM1:0]#

Weirder yet, is that I am not seeing G-ARP on either cluster member when failing over successfully:

[Expert@CPCM1:0]# clusterXL_admin down

This command does not survive reboot. To make the change permanent, run either the 'set cluster member admin {down|up} permanent' command in Gaia Clish, or the 'clusterXL_admin {down|up} -p' command in Expert mode

Setting member to administratively down state ...

Member current state is DOWN

[Expert@CPCM1:0]#

We are not seeing G-ARP requests on either cluster members, (contrary to what was implied in the CheckMated thread https://community.checkpoint.com/t5/Security-Gateways/How-to-send-G-ARP-manually/m-p/69914)

[Expert@CPCM1:0]# tcpdump -ni eth4 -c4 broadcast and arp and arp[6:2] == 2

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on eth4, link-type EN10MB (Ethernet), capture size 262144 bytes

[Expert@CPCM2:0]# tcpdump -ni eth4 -c4 broadcast and arp and arp[6:2] == 2

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on eth4, link-type EN10MB (Ethernet), capture size 262144 bytes

While actual failover is taking place successfully:

[Expert@CPCM1:0]# cphaprob stat

Cluster Mode: High Availability (Active Up) with IGMP Membership

ID Unique Address Assigned Load State Name

1 (local) 192.168.255.2 0% DOWN CPCM1

2 192.168.255.3 100% ACTIVE CPCM2

Active PNOTEs: ADMIN

Last member state change event:

Event Code: CLUS-111400

State change: ACTIVE -> DOWN

Reason for state change: ADMIN_DOWN PNOTE

Event time: Wed Aug 11 13:08:48 2021

Last cluster failover event:

Transition to new ACTIVE: Member 1 -> Member 2

Reason: ADMIN_DOWN PNOTE

Event time: Wed Aug 11 13:08:47 2021

Cluster failover count:

Failover counter: 3

Time of counter reset: Wed Aug 11 10:29:31 2021 (reboot)

[Expert@CPCM1:0]#

[Expert@CPCM2:0]# cphaprob stat

Cluster Mode: High Availability (Active Up) with IGMP Membership

ID Unique Address Assigned Load State Name

1 192.168.255.2 0% DOWN CPCM1

2 (local) 192.168.255.3 100% ACTIVE CPCM2

Active PNOTEs: None

Last member state change event:

Event Code: CLUS-114704

State change: STANDBY -> ACTIVE

Reason for state change: No other ACTIVE members have been found in the cluster

Event time: Wed Aug 11 13:08:47 2021

Last cluster failover event:

Transition to new ACTIVE: Member 1 -> Member 2

Reason: ADMIN_DOWN PNOTE

Event time: Wed Aug 11 13:08:47 2021

Cluster failover count:

Failover counter: 3

Time of counter reset: Wed Aug 11 10:29:31 2021 (reboot)

[Expert@CPCM2:0]#

[Expert@CPCM2:0]# cphaprob stat

Cluster Mode: High Availability (Active Up) with IGMP Membership

ID Unique Address Assigned Load State Name

1 192.168.255.2 0% DOWN CPCM1

2 (local) 192.168.255.3 100% ACTIVE CPCM2

Active PNOTEs: None

Last member state change event:

Event Code: CLUS-114704

State change: STANDBY -> ACTIVE

Reason for state change: No other ACTIVE members have been found in the cluster

Event time: Wed Aug 11 13:08:47 2021

Last cluster failover event:

Transition to new ACTIVE: Member 1 -> Member 2

Reason: ADMIN_DOWN PNOTE

Event time: Wed Aug 11 13:08:47 2021

Cluster failover count:

Failover counter: 3

Time of counter reset: Wed Aug 11 10:29:31 2021 (reboot)

[Expert@CPCM2:0]#

It'll be great to hear from someone who has tackled this issue successfully in the field.

Thank you,

Vladimir

Timothy_Hall · ‎2021-08-12

This is possible using a tool called scapy which is not included with Gaia but can be downloaded here: https://scapy.net/

CAUTION: Adding extra Linux software to Gaia is UNSUPPORTED and may result in very bad things. You have been warned.

Scapy utilizes Python & libpcap which is built in to Gaia, so it doesn't need to be compiled with gcc or anything like that. The following was performed on a R81 gateway with Gaia 3.10 in the Shadow Peak lab, here is the scapy code broken out from the screenshot:

SPOOF_MAC = '02:02:02:02:02:02'
BCAST_MAC = 'ff:ff:ff:ff:ff:ff'
SPOOFIPADDR = '192.0.2.222'

from scapy.all import *

send(ARP(op=2,psrc=SPOOFIPADDR,pdst=SPOOFIPADDR,hwsrc=SPOOF_MAC,hwdst=BCAST_MAC))

And the resulting forged packet as seen in cppcap:

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices
Self-Guided Video Series Coming Soon

Vladimir · ‎2021-08-12

Thank you Tim. I've seen references to scapy before, but it is nice to actually see it working.

Theoretically, snapshotting pre-deployed cluster member, using scapy and reverting to the snapshot should do the trick.

I doubt that the banking and insurance clients would go for using 3rd party tool though.

It'll be nice to have Check Point providing native solution for this issue, as it'll make transition from competing solutions less of a headache. Not to mention simply enabling vMAC feature.

Vladimir · ‎2021-08-12

Also, the sk50840 suggests that there is a G-ARP sourcing from VMAC:

"With VMAC mode, the G-ARP packet is the only packet which is sourced from the Virtual MAC address. Upon failover, the connected switch learns the port location of the new active member using the MAC Learning process and the switch updates it's CAM table with the new port location. It learns the new location (switch port) of the VMAC, and directs the frames to the new location. In this way, the router does not need to do any work to result in the failover of traffic."

Would be nice to get some clarity from Check Point as to what happened to it...

Alexander_Wilke · ‎2023-06-30

There is a new sk50840 which supports g-arp with vmac as source.

Unfortunately CheckPoint missed that Cisco ACI and Cisco FabricPath environments are using "Conversational learning" mechanisms to learn MAC addresses and store it in its CAM table. Onle bi-directional conversations are stored in CAM table.

Further Broadcasts or Multicasts are not used to store MACs in CAM table. CheckPoint missed that so g-arp with VMAC as source will not work.

Unfortunately they decided to do something different in there product lines. ClusterXL VMAC is only g-arp and if you use Maestro or 64k then there is a different implementation - using vmac as source in all packets.

sk165674

Are you a member of CheckMates?

Forcing Gratuitous ARP (G-ARP) from ClusterXL with vMAC