Solved: Re: Subinterface down, results in failover

Justin_Hickey · ‎2017-09-28

Having firewall failovers about once a day. The logs show sub interfaces within the DMZ physical interfaces going down. It's not always the same interface that is reported as down. I'm unsure how a virtual interface can report as physically down or missing. So far I've seen no issues in the limited poking around I've done on the Cisco switch side. Any help is appreciated.

Marco_Valenti · ‎2017-09-28

Issue cphaprob stat from ssh and in cluster mode you will get the information needed , btw I will point my finger on layer 2 configuration of bot switches and see if igmp snooping is on

cphaprob stat

Cluster Mode: High Availability (Primary Up) with IGMP Membership

Number Unique Address Assigned Load State

1 (local) 100% Active
2 0% Standby

View solution in original post

Marco_Valenti · ‎2017-09-28

how many vlan are configured on that interface? is that a single interface or a bond? does gateway are r77.30 with any jumbo hotfixes installed?

are you on a HA cluster with ccp in multicast or broadcast mode? most of the time igmp snooping can be enabled switch side with the result that multicast packet get dropped

Justin_Hickey · ‎2017-09-28

There are 2 independent DMZ trunks, each with 5 subinterfaces. Gateway is R80.10 with Take 35

Not sure where I check that HA cluster mode. Could you point me in the right direction ?

Marco_Valenti · ‎2017-09-28

Issue cphaprob stat from ssh and in cluster mode you will get the information needed , btw I will point my finger on layer 2 configuration of bot switches and see if igmp snooping is on

cphaprob stat

Cluster Mode: High Availability (Primary Up) with IGMP Membership

Number Unique Address Assigned Load State

1 (local) 100% Active
2 0% Standby

PhoneBoy · ‎2017-09-28

FYI, if it's the lowest VLAN that "goes down" then the whole interface will report as down.

That's by design.

Hugo_vd_Kooij · ‎2017-09-29

Shouldn't the default settings check both lowest and highest VLAN?

<< We make miracles happen while you wait. The impossible jobs take just a wee bit longer. >>

Norbert_Bohusch · ‎2017-10-02

yes it's both lowest and highest. It's the case since I don't know exactly but assume R76

Justin_Hickey · ‎2017-10-02

R80.10 Take 35

Justin_Hickey · ‎2017-09-29

Thanks for all the responses. IGMP snooping is indeed on. I'm asking the network team to disable snooping on the dmz switch as a whole. They have concerns it might impact something else in the switches. I honestly don't know what benefit if any IGMP snooping might have in stand alone DMZ switches.

Marco_Valenti · ‎2017-09-29

Well , if you use multicast in anyway could be , but I really don't thinks so I mean any application should register to his multicast group so that won't be an issue , but you need to know that you can switch ccp packet to use broadcast mode intsead of multicast mode but this will increase traffic on your switch a LOT I mean really a lot if you have tons of interfaces configuread in cluster xl

Norbert_Bohusch · ‎2017-10-02

The issue could be the following:

If the sub-interface which is seen down in logs is either lowest or highest of the trunk, it could be possible that if there is no traffic on this subnet besides the two firewall nodes than a policy install of this cluster could lead to such behavior.

Normally the firewall nodes reply to each other and so they see the interface as up, but during policy commit the nodes reply too slow and as there is really no traffic seen on the interface in this moment the gateway declares the interface as down!

To mitigate that, interfaces without nodes behind it should be changed to non-monitored ones!

Justin_Hickey · ‎2017-10-02

I mean, we do have dmz interfaces, (wired guest subnets), which could have no traffic on them for extended periods of time. No idea how I would mitigate this and they need to be up all of the time even if there is no traffic.

Timothy_Hall · ‎2017-10-02

The ClusterXL "Interface Active Check" can fail and an interface declared down even if the firewalls can see each other's CCP traffic consistently across an interface; this situation can occur if there is not at least one other pingable host located on that interface. When ClusterXL notices that only the firewalls and their associated CCP traffic are present on an interface (because there are no ARP entries present for any other hosts on that interface), the cluster members will begin probing that interface's VLAN with ping scans trying to locate at least one responding host. The first time I saw this ping scan behavior was quite unsettling as I thought there was some kind of compromise in the network. This ClusterXL probing behavior is mentioned in item #3 here:

sk43984: Interface flapping when cluster interfaces are connected through several switches

What ClusterXL is looking for in this instance is a VLAN misconfiguration issue, where both firewalls are on the same VLAN and can see each other, but they are on the wrong VLAN to provide service to the hosts on that subnet. After all, why did you create this clustered interface if there are no hosts present there?

Justin mentioned that the problematic interface is some kind of guest subnet, so it is plausible that during certain periods there are no pingable hosts present and the interface will be declared down until a host shows up. The best solution here is to make sure there is a pingable host located on that interface at all times, by adding a switch or wireless access point management IP address that will always be present and responding.

There are some other ways to deal with this by modifying ClusterXL kernel variables and such, but the above solution is the easiest to implement and will help ClusterXL to more accurately detect real network failures.

--
My book "Max Power: Check Point Firewall Performance Optimization"
now available via http://maxpowerfirewalls.com.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Justin_Hickey · ‎2017-10-02

Thanks Tim, Just lit up 2 vlan interfaces on each and every subnet. Should know soon if that's the fix. Thanks.

Hugo_vd_Kooij · ‎2017-10-02

Justin,

My first suggestion would be to switch to broadcast mode first. If the issue is gone you know you have a multicast issue discuss with switch expert(s).

But I have seen issues where this could nog be resolved and the firewall are on broadcast mode as permanent solution.

Stretched VLAN's is one those buzzwords that causes me to grab a bottle of painkillers.

<< We make miracles happen while you wait. The impossible jobs take just a wee bit longer. >>

Justin_Hickey · ‎2017-10-02

Thanks Hugo, Going to hold off to see if creating ping'able interfaces on all vlans is the solution. Then, the plan is to try this. Thanks for the reply.

Timothy_Hall · ‎2017-10-02

Yep stretched VLANs and ClusterXL don't tend to work together well unless network conditions are perfect, mainly because Cluster XL assumes that cluster networks meet the minimum requirements for latency (less than ~30 ms) and packet loss (less than ~2-3%). Numbers going higher than these, even briefly, will cause all kinds of undesired ClusterXL behavior.

--
My book "Max Power: Check Point Firewall Performance Optimization"
now available via http://maxpowerfirewalls.com.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Justin_Hickey · ‎2017-10-03

The active IPs scenario didn't fix the problem so I am back to the option of disabling multicast to see if that is the issue. I did this by issuing the below command:

[Expert@HostName]# cphaconf set_ccp broadcast

I haven't noticed and performance difference. I'd like to see this broadcast traffic to judge for myself how much of it there is. I did an fw monitor on one of the real addresses of the firewall and am not seeing any broadcast traffic.

fw monitor -e "accept host(xxx.xx.206.2);"

Curious if anyone can help me craft a statement that will show the broadcasts.

Norbert_Bohusch · ‎2017-10-03

The CCP-mode is a layer 2 change!

This means the destination MAC address are broadcast or multicast addresses. Layer 3 remains untouched and Destination IP should be in both cases the network address.

This is from my lab in multicast mode:

# tcpdump -enni eth2
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth2, link-type EN10MB (Ethernet), capture size 96 bytes

14:02:07.773569 00:00:00:00:01:01 > 01:00:5e:28:e5:fa, ethertype IPv4 (0x0800), length 76: 0.0.0.0.8116 > 192.168.229.0.8116: UDP, length 34

and this in broadcast mode:

# tcpdump -enni eth2
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth2, link-type EN10MB (Ethernet), capture size 96 bytes
14:02:21.317617 00:00:00:00:01:00 > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 82: 0.0.0.0.8116 > 192.168.229.0.8116: UDP, length 40

Albin_Hakansson · ‎2017-10-03

Since CCP works on port 8116, you could try tcpdump -nei ethX port 8116

Justin_Hickey · ‎2017-10-04

After switching to broadcast mode the fail overs have stopped. I ran this command, tcpdump -nei eth1-04 port 8116 , and I see that I am now processing about 20 broadcasts per second. I'd like the network team to disable IGMP Snooping because I don't think it has any real value in DMZ Switches.

Many thanks to everyone who responded with suggestions. This is an amazing support group. Hope I can return the favor someday.

Are you a member of CheckMates?

Subinterface down, results in failover