Hi
I got a lab inside EVE-NG.
I have configured one physical interface as bond0.
bond0 is including 3 VLAN interfaces, one and sometimes two pf these VLAN interfaces keep going down
the other side is a cisco switch with trunk port connected to interface eth5 in the gateway
A-GW-1> show cluster state
Cluster Mode: High Availability (Active Up) with IGMP Membership
ID Unique Address Assigned Load State Name
1 (local) 172.22.1.2 0% DOWN A-GW-1
2 172.22.1.1 100% ACTIVE(!) A-GW-2
Active PNOTEs: LPRB, IAC
Last member state change event:
Event Code: CLUS-110300
State change: STANDBY -> DOWN
Reason for state change: Interface bond0.10 is down (Cluster Control Protocol packets are not received)
Event time: Thu May 2 17:42:56 2024
Last cluster failover event:
Transition to new ACTIVE: Member 1 -> Member 2
Reason: Interface bond0.10 is down (Cluster Control Protocol packets are not received)
Event time: Thu May 2 17:39:06 2024
Cluster failover count:
Failover counter: 3
Time of counter reset: Thu May 2 16:19:41 2024 (reboot)
A-GW-1> sh
A-GW-1> show inter
interface - Show a specific interface's configurations
interfaces - Lists all interfaces
A-GW-1> show interface bond0.10
state on
mac-addr 50:00:00:08:00:05
type vlan
link-state not available
mtu 1500
auto-negotiation off (bond0)
speed 1000M (bond0)
ipv6-autoconfig Not configured
monitor-mode Not configured
duplex full (bond0)
link-speed Not configured
comments VLAN_10
ipv4-address 10.10.10.10/24
ipv6-address Not Configured
ipv6-local-link-address Not Configured
Statistics:
TX bytes:14208090 packets:338285 errors:0 dropped:0 overruns:0 carrier:0
RX bytes:0 packets:0 errors:0 dropped:0 overruns:0 frame:0
so the interface state is on but still down in the cluster
here is the log from /var/log/messages
ext3 jbd dm_multipath lp pcspkr sr_mod cdrom psmouse serio_raw button parport_pc parport e1000 i2c_piix4 dm_snapshot dm_bufio dm_zero dm_mirror dm_region_hash dm_log dm_mod xfs mptspi mptscsih mptbase virtio_scsi virtio_blk virtio_pci virtio_ring virtio nvme nvme_core ata_piix ahci libahci libata sg sym53c8xx scsi_transport_spi cciss sd_mod crc_t10dif crct10dif_common scsi_transport_fc scsi_tgt
May 2 17:38:45 2024 A-GW-2 kernel:CPU: 0 PID: 7370 Comm: snd_c Kdump: loaded Tainted: P OEL ------------ 3.10.0-1160.15.2cpx86_64 #1
May 2 17:38:45 2024 A-GW-2 kernel:Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
May 2 17:38:45 2024 A-GW-2 kernel:task: ffff8800376ad540 ti: ffff880087dd4000 task.ti: ffff880087dd4000
May 2 17:38:45 2024 A-GW-2 kernel:RIP: 0010:[<ffffffff902d9a11>] [<ffffffff902d9a11>] e1000_alloc_rx_buffers+0xd1/0x6d0 [e1000]
May 2 17:38:45 2024 A-GW-2 kernel:RSP: 0018:ffff8801bfc03d48 EFLAGS: 00000286
May 2 17:38:45 2024 A-GW-2 kernel:RAX: ffffc90001082818 RBX: 00ff8801bfc039fc RCX: 00000000000005f2
May 2 17:38:45 2024 A-GW-2 kernel:RDX: 00000000000000e8 RSI: 000000009d24a640 RDI: ffff8801acf1a8c0
May 2 17:38:45 2024 A-GW-2 kernel:RBP: ffff8801bfc03da0 R08: 0000000000000000 R09: ffff8801bfc03b40
May 2 17:38:45 2024 A-GW-2 kernel:R10: ffff8801bfc039fc R11: 0000000000000000 R12: ffff8801bfc03cb8
May 2 17:38:45 2024 A-GW-2 kernel:R13: ffffffff817c544a R14: ffff8801bfc03da0 R15: ffffc90025015e90
May 2 17:38:45 2024 A-GW-2 kernel:FS: 0000000000000000(0000) GS:ffff8801bfc00000(0000) knlGS:0000000000000000
May 2 17:38:45 2024 A-GW-2 kernel:CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 2 17:38:45 2024 A-GW-2 kernel:CR2: 00000000f7721000 CR3: 000000018d510000 CR4: 00000000000006f0
May 2 17:38:45 2024 A-GW-2 kernel:Call Trace:
May 2 17:38:45 2024 A-GW-2 kernel: <IRQ>
May 2 17:38:45 2024 A-GW-2 kernel: [<ffffffff902d96e1>] e1000_clean_rx_irq+0x2d1/0x530 [e1000]
May 2 17:38:45 2024 A-GW-2 kernel: [<ffffffff902da233>] e1000_clean+0x223/0x8c0 [e1000]
May 2 17:38:45 2024 A-GW-2 kernel: [<ffffffff810a4b8c>] ? mod_timer+0x10c/0x240
May 2 17:38:45 2024 A-GW-2 kernel: [<ffffffff816a23fb>] net_rx_action+0x26b/0x3a0
May 2 17:38:45 2024 A-GW-2 kernel: [<ffffffff8109aef8>] __do_softirq+0x128/0x290
May 2 17:38:45 2024 A-GW-2 kernel: [<ffffffff817c797c>] call_softirq+0x1c/0x30
May 2 17:38:45 2024 A-GW-2 kernel: <EOI>
May 2 17:38:45 2024 A-GW-2 kernel: [<ffffffff8102f9a5>] do_softirq+0x55/0x90
May 2 17:38:45 2024 A-GW-2 kernel: [<ffffffff8109a2f0>] __local_bh_enable_ip+0x60/0x70
May 2 17:38:45 2024 A-GW-2 kernel: [<ffffffff8109a317>] local_bh_enable+0x17/0x20
May 2 17:38:45 2024 A-GW-2 kernel: [<ffffffff90b01256>] cphwd_api_message+0x6d6/0xae0 [simmod_0]
May 2 17:38:45 2024 A-GW-2 kernel: [<ffffffff916e4160>] ? cphwd_q_pending_queue_try_flush+0x460/0x460 [fw_0]
May 2 17:38:45 2024 A-GW-2 kernel: [<ffffffff916e4186>] ? cphwd_q_async_dequeue_cb.lto_priv.2422+0x26/0x70 [fw_0]
May 2 17:38:45 2024 A-GW-2 kernel: [<ffffffff927eca4e>] ? kernel_thread_run+0x39e/0xfb0 [fw_0]
May 2 17:38:45 2024 A-GW-2 kernel: [<ffffffff810bff50>] ? wake_up_atomic_t+0x30/0x30
May 2 17:38:45 2024 A-GW-2 kernel: [<ffffffff927ac4d0>] ? cpaq_kut_register_client+0x40/0x40 [fw_0]
May 2 17:38:45 2024 A-GW-2 kernel: [<ffffffff927b219e>] ? kiss_kthread_run+0x1e/0x50 [fw_0]
May 2 17:38:45 2024 A-GW-2 kernel: [<ffffffff927ac4eb>] ? plat_run_thread+0x1b/0x30 [fw_0]
May 2 17:38:45 2024 A-GW-2 kernel: [<ffffffff810befc2>] ? kthread+0xe2/0xf0
May 2 17:38:45 2024 A-GW-2 kernel: [<ffffffff810beee0>] ? insert_kthread_work+0x40/0x40
May 2 17:38:45 2024 A-GW-2 kernel: [<ffffffff817c429d>] ? ret_from_fork_nospec_begin+0x7/0x21
May 2 17:38:45 2024 A-GW-2 kernel: [<ffffffff810beee0>] ? insert_kthread_work+0x40/0x40
May 2 17:38:45 2024 A-GW-2 kernel:Code: 00 00 45 3b 65 18 74 23 45 85 e4 45 89 65 18 41 8d 54 24 ff 0f 84 c5 04 00 00 0f ae f8 41 0f b7 45 36 49 03 87 d0 03 00 00 89 10 <48> 83 c4 30 5b 41 5c 41 5d 41 5e 41 5f 5d c3 4c 89 ff e8 18 ca
May 2 17:38:45 2024 A-GW-2 kernel:sending NMI to other CPUs:
May 2 17:38:45 2024 A-GW-2 kernel:NMI backtrace for cpu 1 skipped: idling at pc 0xffffffff817b83fb
May 2 17:38:49 2024 A-GW-2 spike_detective: spike info: type: cpu, cpu core: 0, top consumer: system interrupts, start time: 02/05/24 17:38:25, spike duration (sec): 23, initial cpu usage: 100, average cpu usage: 100, perf taken: 0
May 2 17:39:00 2024 A-GW-2 kernel:[fw4_1];CLUS-220201-2: Starting CUL mode because CPU usage (81%) on the remote member 1 increased above the configured threshold (80%).
May 2 17:39:01 2024 A-GW-2 kernel:[fw4_1];CLUS-210300-2: Remote member 1 (state STANDBY -> DOWN) | Reason: Interface is down (Cluster Control Protocol packets are not received)
May 2 17:39:06 2024 A-GW-2 kernel:[fw4_1];CLUS-114405-2: State change: ACTIVE! -> STANDBY | Reason: Member state has been changed after returning from ACTIVE/ACTIVE scenario (remote cluster member 1 has higher priority)
May 2 17:39:06 2024 A-GW-2 kernel:[fw4_1];CLUS-210305-2: Remote member 1 (state DOWN -> ACTIVE(!)) | Reason: Interface is down (Cluster Control Protocol packets are not received)
May 2 17:39:06 2024 A-GW-2 kernel:[fw4_1];CLUS-100201-2: Failover member 2 -> member 1 | Reason: Member state has been changed after returning from ACTIVE/ACTIVE scenario (remote cluster member 1 has higher priority)
May 2 17:39:07 2024 A-GW-2 kernel:[fw4_1];CLUS-210300-2: Remote member 1 (state ACTIVE(!) -> DOWN) | Reason: Interface is down (Cluster Control Protocol packets are not received)
May 2 17:39:07 2024 A-GW-2 kernel:[fw4_1];CLUS-114704-2: State change: STANDBY -> ACTIVE | Reason: No other ACTIVE members have been found in the cluster
May 2 17:39:07 2024 A-GW-2 kernel:[fw4_1];CLUS-100102-2: Failover member 1 -> member 2 | Reason: Available on member 1
May 2 17:39:07 2024 A-GW-2 kernel:[fw4_1];CLUS-214802-2: Remote member 1 (state DOWN -> STANDBY) | Reason: There is already an ACTIVE member in the cluster
May 2 17:39:12 2024 A-GW-2 spike_detective: spike info: type: cpu, cpu core: 0, top consumer: system interrupts, start time: 02/05/24 17:38:54, spike duration (sec): 17, initial cpu usage: 100, average cpu usage: 100, perf taken: 0
May 2 17:40:00 2024 A-GW-2 xpand[6195]: admin localhost t +volatile:clish:admin:28133 t
May 2 17:40:00 2024 A-GW-2 clish[28133]: User admin logged in with ReadWrite permission
May 2 17:42:56 2024 A-GW-2 kernel:[fw4_1];CLUS-120202-2: Stopping CUL mode after 199 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
May 2 17:42:56 2024 A-GW-2 kernel:[fw4_1];CLUS-110305-2: State change: ACTIVE -> ACTIVE(!) | Reason: Interface bond0.10 is down (Cluster Control Protocol packets are not received)
May 2 17:42:56 2024 A-GW-2 kernel:[fw4_1];CLUS-210300-2: Remote member 1 (state STANDBY -> DOWN) | Reason: Interface is down (Cluster Control Protocol packets are not received)
May 2 17:43:02 2024 A-GW-2 kernel:[fw4_1];CLUS-114904-2: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
May 2 17:43:02 2024 A-GW-2 kernel:[fw4_1];CLUS-120200-2: Starting CUL mode because CPU-00 usage (81%) on the local member increased above the configured threshold (80%).
May 2 17:43:22 2024 A-GW-2 kernel:[fw4_1];CLUS-120202-2: Stopping CUL mode after 17 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
May 2 17:43:22 2024 A-GW-2 kernel:[fw4_1];CLUS-110305-2: State change: ACTIVE -> ACTIVE(!) | Reason: Interface bond0.10 is down (Cluster Control Protocol packets are not received)
May 2 17:43:28 2024 A-GW-2 kernel:[fw4_1];CLUS-114904-2: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
May 2 17:43:33 2024 A-GW-2 kernel:[fw4_1];CLUS-110305-2: State change: ACTIVE -> ACTIVE(!) | Reason: Interface bond0.10 is down (Cluster Control Protocol packets are not received)
May 2 17:43:35 2024 A-GW-2 kernel:[fw4_1];CLUS-120200-2: Starting CUL mode because CPU-00 usage (88%) on the local member increased above the configured threshold (80%).
May 2 17:43:35 2024 A-GW-2 kernel:[fw4_1];CLUS-114904-2: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
May 2 17:43:46 2024 A-GW-2 kernel:[fw4_1];CLUS-120202-2: Stopping CUL mode after 10 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
May 2 17:43:46 2024 A-GW-2 kernel:[fw4_1];CLUS-110305-2: State change: ACTIVE -> ACTIVE(!) | Reason: Interface bond0.10 is down (Cluster Control Protocol packets are not received)
May 2 17:44:03 2024 A-GW-2 kernel:[fw4_1];CLUS-220201-2: Starting CUL mode because CPU usage (87%) on the remote member 1 increased above the configured threshold (80%).
May 2 17:44:03 2024 A-GW-2 kernel:[fw4_1];CLUS-114904-2: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
May 2 17:44:44 2024 A-GW-2 kernel:[fw4_1];CLUS-120202-2: Stopping CUL mode after 37 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
May 2 17:44:44 2024 A-GW-2 kernel:[fw4_1];CLUS-110305-2: State change: ACTIVE -> ACTIVE(!) | Reason: Interface bond0.10 is down (Cluster Control Protocol packets are not received)
May 2 17:45:03 2024 A-GW-2 kernel:[fw4_1];CLUS-114904-2: State change: ACTIVE(!) -> ACTIVE | Reason: Reason for ACTIVE! alert has been resolved