Solved: 4 node cluster, all members are randomly seg fault...

dphonovation · ‎2023-01-06

I had 2 clusters both managed by the same gateway. 2 gateways each.

I've since removed the 2nd cluster and readded those gateways to my first cluster. Still the same management gateway.

Re-established SIC after renaming them. Pushed policy, etc. All seemed to go well.

I can't be sure this was the cause but it appears ever since I've been getting a flapping cluster and upon inspection, i see this in /var/log/messages:

from /var/log/messages:

 

Jan 6 14:47:01 2023 cp-fw4-site2 xpand[11083]: Configuration changed from localhost by user admin by the service dbset

Jan 6 14:47:01 2023 cp-fw4-site2 last message repeated 6 times

Jan 6 14:47:12 2023 cp-fw4-site2 fwk: CLUS-120001-1: Cluster policy installation started (old/new Policy ID: 913558753/716459846)

Jan 6 14:47:13 2023 cp-fw4-site2 xpand[11083]: admin localhost t +volatile:clish:admin:6141 t

Jan 6 14:47:13 2023 cp-fw4-site2 clish[6141]: User admin running clish -c with ReadWrite permission

Jan 6 14:47:13 2023 cp-fw4-site2 clish[6141]: cmd by admin: Start executing : ver (cmd md5: 0812f14f43315611dd0ef462515c9d00)

Jan 6 14:47:13 2023 cp-fw4-site2 clish[6141]: cmd by admin: Processing : ver (cmd md5: 0812f14f43315611dd0ef462515c9d00)

Jan 6 14:47:13 2023 cp-fw4-site2 xpand[11083]: admin localhost t -volatile:clish:admin:6141

Jan 6 14:47:13 2023 cp-fw4-site2 clish[6141]: User admin finished running clish -c from CLI shell

Jan 6 14:47:14 2023 cp-fw4-site2 spike_detective: spike info: type: thread, thread id: 3263, thread name: fw_full, start time: 06/01/23 14:47:07, spike duration (sec): 6, initial cpu usage: 79, average cpu usage: 79, perf taken: 0

Jan 6 14:47:15 2023 cp-fw4-site2 fwk: CLUS-120005-4: Cluster policy installation finished - no change was done (Type-2)

Jan 6 14:47:15 2023 cp-fw4-site2 fwk: CLUS-120125-4: CCP Encryption turned ON

Jan 6 14:47:19 2023 cp-fw4-site2 xpand[11083]: Configuration changed from localhost by user admin by the service dbset

Jan 6 14:47:22 2023 cp-fw4-site2 kernel: fw_full (3202) used greatest stack depth: 12040 bytes left

Jan 6 14:47:28 2023 cp-fw4-site2 xpand[11083]: Configuration changed from localhost by user admin by the service dbset

Jan 6 14:47:28 2023 cp-fw4-site2 last message repeated 6 times

Jan 6 14:47:50 2023 cp-fw4-site2 kernel: fwk0_0[11551]: segfault at 7fb0e86d5cb8 ip 00007fb0a90f972e sp 00007fb058f3c1d0 error 6 in libfw_kern_64_us_0.so[7fb0a8188000+1cbf000]

Jan 6 14:47:54 2023 cp-fw4-site2 spike_detective: spike info: type: cpu, cpu core: 1, top consumer: fw_full, start time: 06/01/23 14:46:50, spike duration (sec): 63, initial cpu usage: 99, average cpu usage: 96, perf taken: 0

Jan 6 14:47:54 2023 cp-fw4-site2 spike_detective: spike info: type: thread, thread id: 6094, thread name: fw_full, start time: 06/01/23 14:47:42, spike duration (sec): 11, initial cpu usage: 86, average cpu usage: 85, perf taken: 0

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f118000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4d1000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4d2000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4d3000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4d4000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4d5000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4d6000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4d7000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4d8000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4d9000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4da000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4db000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4dc000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4dd000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4de000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4df000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4e0000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4e1000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4e2000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4e3000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4e4000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4e5000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4e6000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4e7000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4e8000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4e9000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4ea000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4eb000, vm_start 0x7fb05f118000)

So far I've tried disabling/reenabling Content Awareness. Disabling CoreXL.

The more common problem gateway are the 2 that were readded, but the first 2 members are also occasionally experiencing the same segfault.

I see the configuration changed message - this isn't me doing anything. From a different SK I read this is update manager?

Can anyone shed any light on what could be causing this? I've already started a case with TAC but its dragging a bit.

dphonovation · ‎2023-01-06

I'm not 100% but it apperas this might have been caused by incorrect hosts entries (as seen in gaia web portal -> general -> hosts and dns). While I had tried removing all gateways from a cloning group and adding them back in - it seems some had discrepancies. 1 FW I found have a wrong entry for a different member and the 4th FW had duplicates.

View solution in original post

dphonovation · ‎2023-01-06

I'm not 100% but it apperas this might have been caused by incorrect hosts entries (as seen in gaia web portal -> general -> hosts and dns). While I had tried removing all gateways from a cloning group and adding them back in - it seems some had discrepancies. 1 FW I found have a wrong entry for a different member and the 4th FW had duplicates.

the_rock · ‎2023-01-06

Was hostname different from web UI and if you run show hostname from clish?

Best,
Andy

dphonovation · ‎2023-01-06

No, the name definitions were correct, but the IPs were not.

the_rock · ‎2023-01-06

K, I gotcha...yea, that should match, for sure. Still, regardless, I find it bit odd that would have cause seg faulting.

Best,
Andy

dphonovation · ‎2023-01-06

Agreed. I called TAC in tears. They took the coredumps and cpinfos away. I started clicking thru settings and found that. Since then, no coredumps.

the_rock · ‎2023-01-06

I hear you, those issues are never fun. Happy its fixed.

Best,
Andy

Arskazv · ‎2023-07-06

Hi!
I met this yesterday. This cluster has been running several months without problems. Yesterday the STANDBY member of cluster got to weird state where policy install was impossible, HA was shut down etc. Similarlog entries:
Jul 3 13:30:38 2023 fw2 kernel: fwk0_dev_0[25082]: segfault at 80008 ip 00007f01ccf28c15 sp 00007ffda8c32b30 error 4 in libfw_kern_64_us_3.so[7f01cc415000+1cc8000]
Jul 3 13:31:02 2023 fw2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_dev_0, bad kernel_address (user_address 0x7f0184075000, vm_start 0x7f0184075000)

Reboot solved this, but the root cause is unknown. R81.10 with JHF Take 94.

Are you a member of CheckMates?

4 node cluster, all members are randomly seg faulting