Hi all,
Every now and then a customer (same customer) emails me to say "the firewall has gone down again and killed our replication jobs". After several weeks with no problem, this happened again twice yesterday. I found logs in both SmartLog and /var/log/messages which match the times of the connectivity drop. Interestingly it only seems to moan about VLAN 52, so the physical eth3 interface and the other VLAN's on that interface appear to be OK. One thing to note is that the cluster members are at different sites, so my initial thought is some kind of networking issue? Possibly latency if the leased line is being saturated? I've asked the people that support the network to look in to this. Does anyone else have any different thoughts on what could be causing VLAN 52 to lose comms between the cluster members?
Thanks,
Matt
Sep 18 16:28:01 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-110305-1: State change: ACTIVE -> ACTIVE(!) | Reason: Interface eth2 is down (Cluster Control Protocol packets are not received)
Sep 18 16:28:02 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-110305-1: State remains: ACTIVE! | Reason: Interface eth3.52 is down (Cluster Control Protocol packets are not received)
Sep 18 16:28:02 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-210300-1: Remote member 2 (state STANDBY -> DOWN) | Reason: Interface is down (Cluster Control Protocol packets are not received)
Sep 18 16:28:02 2019 xxxxxxxx-fwa kernel: [fw4_1];fwldbcast_handle_retrans_request: Updated bchosts_mask to 1
Sep 18 16:28:02 2019 xxxxxxxx-fwa kernel: [fw4_0];fwldbcast_handle_retrans_request: Updated bchosts_mask to 1
Sep 18 16:28:02 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-214802-1: Remote member 2 (state DOWN -> STANDBY) | Reason: There is already an ACTIVE member in the cluster
Sep 18 16:28:02 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-100102-1: Failover member 1 -> member 2 | Reason: Interface eth3.52 is down (Cluster Control Protocol packets are not received)
Sep 18 16:28:22 2019 xxxxxxxx-fwa kernel: [fw4_1];check_other_machine_activity: Update state of member id 1 to DEAD, didn't hear from it since 930450.9 and now 930453.9
Sep 18 16:28:22 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-216400-1: Remote member 2 (state STANDBY -> LOST) | Reason: Timeout Control Protocol packet expired member declared as DEAD
Sep 18 16:28:48 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-210300-1: Remote member 2 (state LOST -> DOWN) | Reason: Interface is down (Cluster Control Protocol packets are not received)
Sep 18 16:28:48 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-114904-1: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
Sep 18 16:28:48 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-214802-1: Remote member 2 (state DOWN -> STANDBY) | Reason: There is already an ACTIVE member in the cluster
Sep 18 16:43:30 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-210300-1: Remote member 2 (state STANDBY -> DOWN) | Reason: Interface is down (Cluster Control Protocol packets are not received)
Sep 18 16:43:30 2019 xxxxxxxx-fwa kernel: [fw4_1];fwldbcast_handle_retrans_request: Updated bchosts_mask to 1
Sep 18 16:43:30 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-110305-1: State change: ACTIVE -> ACTIVE(!) | Reason: Interface eth3.52 is down (Cluster Control Protocol packets are not received)
Sep 18 16:43:31 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-214802-1: Remote member 2 (state DOWN -> STANDBY) | Reason: There is already an ACTIVE member in the cluster
Sep 18 16:43:31 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-100102-1: Failover member 1 -> member 2 | Reason: Interface eth3.52 is down (Cluster Control Protocol packets are not received)
Sep 18 16:43:52 2019 xxxxxxxx-fwa kernel: [fw4_1];check_other_machine_activity: Update state of member id 1 to DEAD, didn't hear from it since 931378.3 and now 931381.3
Sep 18 16:43:52 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-216400-1: Remote member 2 (state STANDBY -> LOST) | Reason: Timeout Control Protocol packet expired member declared as DEAD
Sep 18 16:45:25 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-114904-1: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
Sep 18 16:45:25 2019 xxxxxxxx-fwa kernel: [fw4_1];CLUS-214802-1: Remote member 2 (state LOST -> STANDBY) | Reason: There is already an ACTIVE member in the cluster