Recently had a client test KVM / oVirt as an alternative virtual environment. During testing we noticed that ClusterXL was repeatedly failing or just not forming a cluster. The active member could not detect the standby members status.
Debug ClusterXL output
;28Aug2018 14:50:31.269123;[cpu_6];[fw4_1];FW-1: check_other_machine_activity: calling fwldbcast_died for ID 0;
;28Aug2018 14:50:31.269138;[cpu_6];[fw4_1];FW-1: fwha_notify_interface: IF_IP_BY_HANDLE(ffff81023de270c0, 1)=10.121.47.131;
;28Aug2018 14:50:31.269145;[cpu_6];[fw4_1];FW-1: fwha_notify_interface: IF_IP_BY_HANDLE(ffff81023d856440, 2)=10.121.34.131;
;28Aug2018 14:50:31.269151;[cpu_6];[fw4_1];FW-1: fwha_notify_interface: IF_IP_BY_HANDLE(ffff81023d54fc40, 3)=10.121.36.131;
;28Aug2018 14:50:31.369011;[cpu_6];[fw4_1];FW-1: check_other_machine_activity: calling fwldbcast_died for ID 0;
;28Aug2018 14:50:31.469934;[cpu_6];[fw4_1];FW-1: check_other_machine_activity: calling fwldbcast_died for ID 0;
;28Aug2018 14:50:31.570847;[cpu_6];[fw4_1];FW-1: check_other_machine_activity: calling fwldbcast_died for ID 0;
;28Aug2018 14:50:31.670768;[cpu_6];[fw4_1];FW-1: fwha_report_id_problem_status: State (DOWN) reported by device Interface Active Check (non-blocking) (ID 1 time 85773.7);
;28Aug2018 14:50:31.670773;[cpu_6];[fw4_1];FW-1: id_blocking_state: check (0) (1) (4) ;
;28Aug2018 14:50:31.670774;[cpu_6];[fw4_1];FW-1: id_blocking_state: check (1) (1) (4) ;
;28Aug2018 14:50:31.670775;[cpu_6];[fw4_1];FW-1: id_blocking_state: check (2) (1) (4) ;
;28Aug2018 14:50:31.670776;[cpu_6];[fw4_1];FW-1: id_blocking_state: check (3) (1) (4) ;
;28Aug2018 14:50:31.670778;[cpu_6];[fw4_1];FW-1: id_blocking_state: check (4) (1) (4) ;
;28Aug2018 14:50:31.670779;[cpu_6];[fw4_1];FW-1: fwha_report_id_problem_status: Blocking state (ACTIVE) not changed by state DOWN from Interface Active Check (ID 1);
;28Aug2018 14:50:31.670788;[cpu_6];[fw4_1];FW-1: check_other_machine_activity: calling fwldbcast_died for ID 0;
;28Aug2018 14:50:31.770675;[cpu_6];[fw4_1];FW-1: check_other_machine_activity: calling fwldbcast_died for ID 0;
;28Aug2018 14:50:31.870573;[cpu_6];[fw4_1];FW-1: check_other_machine_activity: calling fwldbcast_died for ID 0;
;28Aug2018 14:50:31.970513;[cpu_6];[fw4_1];FW-1: check_other_machine_activity: calling fwldbcast_died for ID 0;
From a testing point of view we looked at a number of things including moving the 2 members to the same physical host device. Nothing resolved this inconsistency. We finally were looking at the switching network environment and noted that the MAC's we were trying to communicate with were either not listed in the MAC address table or not pointing to where they should be.
Testing lead us to look at anti spoofing capabilities of oVirt. oVirt and for that matter most hypervisors KVM or otherwise enable an anti MAC spoofing rule to prevent one VM from taking over the traffic of another VM. In our case with clustering that is exactly what we wanted to happen.
From an oVirt point of view we removed the anti MAC spoofing rule from the cluster VM interfaces. At this time oVirt is in the process of adding a default setting to enable the Ant spoofing filter process. See this link for details: https://ovirt.org/develop/release-management/features/network/networkfiltering.html