Re: Thoughts on a random cluster problem?

biskit · ‎2019-09-19

Hi all,

Every now and then a customer (same customer) emails me to say "the firewall has gone down again and killed our replication jobs". After several weeks with no problem, this happened again twice yesterday. I found logs in both SmartLog and /var/log/messages which match the times of the connectivity drop. Interestingly it only seems to moan about VLAN 52, so the physical eth3 interface and the other VLAN's on that interface appear to be OK. One thing to note is that the cluster members are at different sites, so my initial thought is some kind of networking issue? Possibly latency if the leased line is being saturated? I've asked the people that support the network to look in to this. Does anyone else have any different thoughts on what could be causing VLAN 52 to lose comms between the cluster members?

Thanks,

Matt

Sep 18 16:28:01 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-110305-1: State change: ACTIVE -> ACTIVE(!) | Reason: Interface eth2 is down (Cluster Control Protocol packets are not received)

Sep 18 16:28:02 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-110305-1: State remains: ACTIVE! | Reason: Interface eth3.52 is down (Cluster Control Protocol packets are not received)

Sep 18 16:28:02 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-210300-1: Remote member 2 (state STANDBY -> DOWN) | Reason: Interface is down (Cluster Control Protocol packets are not received)

Sep 18 16:28:02 2019 xxxxxxxx-fwa kernel: [fw4_1];fwldbcast_handle_retrans_request: Updated bchosts_mask to 1

Sep 18 16:28:02 2019 xxxxxxxx-fwa kernel: [fw4_0];fwldbcast_handle_retrans_request: Updated bchosts_mask to 1

Sep 18 16:28:02 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-214802-1: Remote member 2 (state DOWN -> STANDBY) | Reason: There is already an ACTIVE member in the cluster

Sep 18 16:28:02 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-100102-1: Failover member 1 -> member 2 | Reason: Interface eth3.52 is down (Cluster Control Protocol packets are not received)

Sep 18 16:28:22 2019 xxxxxxxx-fwa kernel: [fw4_1];check_other_machine_activity: Update state of member id 1 to DEAD, didn't hear from it since 930450.9 and now 930453.9

Sep 18 16:28:22 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-216400-1: Remote member 2 (state STANDBY -> LOST) | Reason: Timeout Control Protocol packet expired member declared as DEAD

Sep 18 16:28:48 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-210300-1: Remote member 2 (state LOST -> DOWN) | Reason: Interface is down (Cluster Control Protocol packets are not received)

Sep 18 16:28:48 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-114904-1: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved

Sep 18 16:28:48 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-214802-1: Remote member 2 (state DOWN -> STANDBY) | Reason: There is already an ACTIVE member in the cluster

Sep 18 16:43:30 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-210300-1: Remote member 2 (state STANDBY -> DOWN) | Reason: Interface is down (Cluster Control Protocol packets are not received)

Sep 18 16:43:30 2019 xxxxxxxx-fwa kernel: [fw4_1];fwldbcast_handle_retrans_request: Updated bchosts_mask to 1

Sep 18 16:43:30 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-110305-1: State change: ACTIVE -> ACTIVE(!) | Reason: Interface eth3.52 is down (Cluster Control Protocol packets are not received)

Sep 18 16:43:31 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-214802-1: Remote member 2 (state DOWN -> STANDBY) | Reason: There is already an ACTIVE member in the cluster

Sep 18 16:43:31 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-100102-1: Failover member 1 -> member 2 | Reason: Interface eth3.52 is down (Cluster Control Protocol packets are not received)

Sep 18 16:43:52 2019 xxxxxxxx-fwa kernel: [fw4_1];check_other_machine_activity: Update state of member id 1 to DEAD, didn't hear from it since 931378.3 and now 931381.3

Sep 18 16:43:52 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-216400-1: Remote member 2 (state STANDBY -> LOST) | Reason: Timeout Control Protocol packet expired member declared as DEAD

Sep 18 16:45:25 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-114904-1: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved

Sep 18 16:45:25 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-214802-1: Remote member 2 (state LOST -> STANDBY) | Reason: There is already an ACTIVE member in the cluster

FedericoMeiners · ‎2019-09-19

@biskit I had a similar case in a very big bank, switches and routers involved were all ok, further investigation on the firewall showed a lot of RX / TX errors on the sync interfaces which caused several loss of CCP packets.

Do you see any errors on those interfaces? If yes then you will need to set up a bond for sync.

Hope it helps,

____________
https://www.linkedin.com/in/federicomeiners/

Duane_Toler · ‎2019-09-19

When dealing with anything geographically-dispersed (across the room or across town), my first thoughts are always "layer 2".

Make sure spanning tree between points A and B are sane, and that you don't have blocked ports on the preferred path, and that you know your root bridge (if not, set one intentionally; spanning-tree vlan 52 priority 4096 at the root of your LAN, or your distribution-layer switch; and set secondary root, too; spanning-tree vlan 52 priority 8192).

Prune your VLANs off trunk ports, and make sure VLAN52 is spanning where you need it, and only there. This keeps CAM tables trimmed as well as keeping the tree tuned. Make sure spanning tree is taking the optimal path, if you're crossing multiple switches to the other peer. If there are intervening switches, check their stats to make sure their CPUs aren't being spiked during high-traffic times; spanning-tree and BPDU processing are a CPU-bound control plane process. "cphaprob syncstat" will tell you if you're losing sync packets when the customer calls you. If you can correlate that to switch stats, that'd be helpful. On the switches, again trim the BPDU processing with VLAN trunk pruning.

I had an issue with a customer long ago that didn't have trunk ports pruned, and they'd call up with similar "outage" whenever they happened to fire up a multicast-based SAN-to-SAN replication session... Sigh... APs on a trunk ports did not appreciate having to ignore the multicast maelstrom they received. 🙂

Anyway.. hope something here gets you pointed in the right direction.

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack

Timothy_Hall · ‎2019-09-19

A few things:

1) Vlan 52 must be the lowest tag number on eth3, as such ClusterXL will only send and receive CCP packets out eth3.52 so that will be the only one it complains about when a problem happens with that interface.

2) Ensure you do not have an IP address on the physical untagged interface eth3 to process untagged traffic. While this configuration will work on a non-clustered firewall, it is most definitely not supported and may lead to some strange behavior.

3) The error messages would seem to indicate mainly a problem with eth3, not necessarily the sync interface. Still worth checking out the health of your sync network though.

4) What mode is CCP set to use? If available in your version I'd strongly suggest setting unicast, this setting eliminates the possibility of a switch mishandling/suppressing broadcast and multicast CCP traffic in a 2-node cluster.

Other than those things gotta agree with @Duane_Toler here, check out Layer 2 for issues. Make sure all switchports attached to the firewall are set to portfast, among other things.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Are you a member of CheckMates?

Thoughts on a random cluster problem?