Hi All,
I posted up about a year ago in relation to unscheduled failovers between a pair of checkpoint SMB 1590s running R80.20.35. Having opened a session out with Checkpoint support we tried changing the DNS probes to using IP addresses when testing for internet connectivity. This vastly reduced the number of failovers. Within the past few weeks failovers have started to occur more regularly again. I am speaking with the customer with regards to their internet connectivity as initial investigation is showing that both nodes can display the following before failover:
023 Mar 1 16:44:33 Gatekeeper2 user.info cposd: [CPOSD] WAN connection "Internet1": Internet connection probe status has changed to Disconnected. servers: 3, fails: 10, attempts: 30
2023 Mar 1 16:44:33 Gatekeeper2 daemon.info dnsmasq: reading /etc/resolv.conf
2023 Mar 1 16:44:33 Gatekeeper2 daemon.info dnsmasq: using nameserver 8.8.8.8#53
2023 Mar 1 16:44:33 Gatekeeper2 daemon.info dnsmasq: using nameserver 8.8.4.4#53
2023 Mar 1 16:44:33 Gatekeeper2 daemon.info dnsmasq: read /var/hosts - 31 addresses
2023 Mar 1 16:44:42 Gatekeeper2 daemon.info dnsmasq: reading /etc/resolv.conf
2023 Mar 1 16:44:42 Gatekeeper2 daemon.info dnsmasq: using nameserver 8.8.8.8#53
2023 Mar 1 16:44:42 Gatekeeper2 daemon.info dnsmasq: using nameserver 8.8.4.4#53
2023 Mar 1 16:44:42 Gatekeeper2 daemon.info dnsmasq: read /var/hosts - 31 addresses
2023 Mar 1 16:44:44 Gatekeeper2 daemoUpong opn.info dnsmasq: reading /ec/resolv.conf
2023 Mar 1 16:44:44 Gatekeeper2 daemon.info dnsmasq: using nameserver 8.8.8.8#53
2023 Mar 1 16:44:44 Gatekeeper2 daemon.info dnsmasq: using nameserver 8.8.4.4#53
2023 Mar 1 16:44:44 Gatekeeper2 daemon.info dnsmasq: read /var/hosts - 31 addresses
2023 Mar 1 16:44:44 Gatekeeper2 daemon.info dnsmasq: read /var/hosts - 31 addresses
2023 Mar 1 16:44:45 Gatekeeper2 user.info cposd: [CPOSD] WAN connection "Internet1": Internet connection probe status has changed to Connected. servers: 3, fails: 9, attempts: 30
2023 Mar 1 16:44:46 Gatekeeper2 user.info lua: [Security Settings] A policy change has been applied
2023 Mar 1 16:44:46 Gatekeeper2 user.info lua: [Security Settings] High Availability policy change has been applied
Upon further investigation with software vendor that these were purchased from we saw in dmesg:
[29837928.418739] mvpp2 f2000000.ethernet eth0: bad rx status 12018514 (crc error), size=1420
[29838238.763245] mvpp2 f2000000.ethernet eth0: bad rx status 12018514 (crc error), size=1420
[29838389.930050] mvpp2 f2000000.ethernet eth0: bad rx status 12018514 (crc error), size=1420
[29838758.413897] mvpp2 f2000000.ethernet eth0: bad rx status 12018514 (crc error), size=1420
[29838807.134983] mvpp2 f2000000.ethernet eth0: bad rx status 12018514 (crc error), size=1420
[29838866.139433] mvpp2 f2000000.ethernet eth0: bad rx status 12018514 (crc error), size=1420
[29839793.944978] mvpp2 f2000000.ethernet eth0: bad rx status 13008514 (crc error), size=66
[29839895.013737] mvpp2 f2000000.ethernet eth0: bad rx status 12018514 (crc error), size=1420
[29839925.978426] mvpp2 f2000000.ethernet eth0: bad rx status 12018514 (crc error), size=1420
[29841751.537571] mvpp2 f2000000.ethernet eth0: bad rx status 12018514 (crc error), size=1420
[29841907.975809] mvpp2 f2000000.ethernet eth0: bad rx status 12018514 (crc error), size=1420
[29842040.630989] mvpp2 f2000000.ethernet eth0: bad rx status 12018514 (crc error), size=1420
[29843430.323976] mvpp2 f2000000.ethernet eth0: bad rx status 12018514 (crc error), size=1420
In relation to the number of total packets the number of CRCs are low
eth0 Link encap:Ethernet HWaddr 00:1C:7F:AE:0A:A2
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:106054832559 errors:11201 dropped:0 overruns:0 frame:0
TX packets:105562308040 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:2048
RX bytes:98857319801669 (89.9 TiB) TX bytes:97218218207773 (88.4 TiB)
Now I read from a previous SK that the mac address for eth0 is the CPU interface to which the LAN1 to LAN8 devices are connected hence all interfaces have the same MAC (default behaviour) so I cannot find easily identify if the issue is caused by a LAN cable switch port etc. Additionally, none of the LAN1-8 interfaces are showing RX errors. I also read about possible *cosmetic errors* for virtual interfaces in another SK
As I see :
Event Code: CLUS-114704
State change: STANDBY -> ACTIVE
Reason for state change: No other ACTIVE members have been found in the cluster
From chpaprob state ; Is this stating that one interface cluster is not seeing traffic from it's peer / or just the sync interface perhaps?
Both units have been up for well over 300 days but we have an outage to patch (up to R81 possibly) and reboot within the next 2 weeks. At this time I can check any cabling. I don't know if there are any specific types of LAN cables we should be using. The issue is only been seen on one of the two units.
I will check the switch statistics shortly too.
I was wondering if anyone had any views or experiences with this?
Thanks and Regards
Dek