Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
dphonovation
Collaborator
Jump to solution

4 node cluster, all members are randomly seg faulting

I had 2 clusters both managed by the same gateway. 2 gateways each.

I've since removed the 2nd cluster and readded those gateways to my first cluster. Still the same management gateway.

Re-established SIC after renaming them. Pushed policy, etc. All seemed to go well.

 

I can't be sure this was the cause but it appears ever since I've been getting a flapping cluster and upon inspection, i see this in /var/log/messages:

 

 

from /var/log/messages:

 

Jan 6 14:47:01 2023 cp-fw4-site2 xpand[11083]: Configuration changed from localhost by user admin by the service dbset

Jan 6 14:47:01 2023 cp-fw4-site2 last message repeated 6 times

Jan 6 14:47:12 2023 cp-fw4-site2 fwk: CLUS-120001-1: Cluster policy installation started (old/new Policy ID: 913558753/716459846)

Jan 6 14:47:13 2023 cp-fw4-site2 xpand[11083]: admin localhost t +volatile:clish:admin:6141 t

Jan 6 14:47:13 2023 cp-fw4-site2 clish[6141]: User admin running clish -c with ReadWrite permission

Jan 6 14:47:13 2023 cp-fw4-site2 clish[6141]: cmd by admin: Start executing : ver (cmd md5: 0812f14f43315611dd0ef462515c9d00)

Jan 6 14:47:13 2023 cp-fw4-site2 clish[6141]: cmd by admin: Processing : ver (cmd md5: 0812f14f43315611dd0ef462515c9d00)

Jan 6 14:47:13 2023 cp-fw4-site2 xpand[11083]: admin localhost t -volatile:clish:admin:6141

Jan 6 14:47:13 2023 cp-fw4-site2 clish[6141]: User admin finished running clish -c from CLI shell

Jan 6 14:47:14 2023 cp-fw4-site2 spike_detective: spike info: type: thread, thread id: 3263, thread name: fw_full, start time: 06/01/23 14:47:07, spike duration (sec): 6, initial cpu usage: 79, average cpu usage: 79, perf taken: 0

Jan 6 14:47:15 2023 cp-fw4-site2 fwk: CLUS-120005-4: Cluster policy installation finished - no change was done (Type-2)

Jan 6 14:47:15 2023 cp-fw4-site2 fwk: CLUS-120125-4: CCP Encryption turned ON

Jan 6 14:47:19 2023 cp-fw4-site2 xpand[11083]: Configuration changed from localhost by user admin by the service dbset

Jan 6 14:47:22 2023 cp-fw4-site2 kernel: fw_full (3202) used greatest stack depth: 12040 bytes left

Jan 6 14:47:28 2023 cp-fw4-site2 xpand[11083]: Configuration changed from localhost by user admin by the service dbset

Jan 6 14:47:28 2023 cp-fw4-site2 last message repeated 6 times

Jan 6 14:47:50 2023 cp-fw4-site2 kernel: fwk0_0[11551]: segfault at 7fb0e86d5cb8 ip 00007fb0a90f972e sp 00007fb058f3c1d0 error 6 in libfw_kern_64_us_0.so[7fb0a8188000+1cbf000]

Jan 6 14:47:54 2023 cp-fw4-site2 spike_detective: spike info: type: cpu, cpu core: 1, top consumer: fw_full, start time: 06/01/23 14:46:50, spike duration (sec): 63, initial cpu usage: 99, average cpu usage: 96, perf taken: 0

Jan 6 14:47:54 2023 cp-fw4-site2 spike_detective: spike info: type: thread, thread id: 6094, thread name: fw_full, start time: 06/01/23 14:47:42, spike duration (sec): 11, initial cpu usage: 86, average cpu usage: 85, perf taken: 0

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f118000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4d1000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4d2000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4d3000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4d4000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4d5000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4d6000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4d7000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4d8000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4d9000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4da000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4db000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4dc000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4dd000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4de000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4df000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4e0000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4e1000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4e2000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4e3000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4e4000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4e5000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4e6000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4e7000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4e8000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4e9000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4ea000, vm_start 0x7fb05f118000)

Jan 6 14:47:57 2023 cp-fw4-site2 kernel: [fw4_0];fwzeco_vm_ops_shinfo_fault: fwk0_0, bad kernel_address (user_address 0x7fb05f4eb000, vm_start 0x7fb05f118000)

 

 

So far I've tried disabling/reenabling Content Awareness. Disabling CoreXL.

The more common problem gateway are the 2 that were readded, but the first 2 members are also occasionally experiencing the same segfault.

I see the configuration changed message - this isn't me doing anything. From a different SK I read this is update manager?

Can anyone shed any light on what could be causing this? I've already started a case with TAC but its dragging a bit.

0 Kudos
1 Solution

Accepted Solutions
dphonovation
Collaborator

I'm not 100% but it apperas this might have been caused by incorrect hosts entries (as seen in gaia web portal -> general -> hosts and dns). While I had tried removing all gateways from a cloning group and adding them back in - it seems some had discrepancies. 1 FW I found have a wrong entry for a different member and the 4th FW had duplicates.

View solution in original post

0 Kudos
6 Replies
dphonovation
Collaborator

I'm not 100% but it apperas this might have been caused by incorrect hosts entries (as seen in gaia web portal -> general -> hosts and dns). While I had tried removing all gateways from a cloning group and adding them back in - it seems some had discrepancies. 1 FW I found have a wrong entry for a different member and the 4th FW had duplicates.

0 Kudos
the_rock
Legend
Legend

Was hostname different from web UI and if you run show hostname from clish?

0 Kudos
dphonovation
Collaborator

No, the name definitions were correct, but the IPs were not.

0 Kudos
the_rock
Legend
Legend

K, I gotcha...yea, that should match, for sure. Still, regardless, I find it bit odd that would have cause seg faulting.

0 Kudos
dphonovation
Collaborator

Agreed. I called TAC in tears. They took the coredumps and cpinfos away. I started clicking thru settings and found that. Since then, no coredumps.

0 Kudos
the_rock
Legend
Legend

I hear you, those issues are never fun. Happy its fixed.

0 Kudos