I have a 16 Core openserver HA Active/Passive cluster running R80.10. Fully hotfixed. It is only licensed for 4 CPUs.
1 SND and 3 workers
The 1 SND sits around 30% utilisation
The 3 workers are balanced around 15% utilisation.
affinity is shown here
fw ctl affinity -l -r
CPU 0: eth3 eth0 eth1 eth8 eth9 eth4 eth7
CPU 1: fw_2
rtmd fwd mpdaemon lpd in.ahclientd in.aclientd in.aftpd cprid cpd
CPU 2: fw_1
rtmd fwd mpdaemon lpd in.ahclientd in.aclientd in.aftpd cprid cpd
CPU 3: fw_0
rtmd fwd mpdaemon lpd in.ahclientd in.aclientd in.aftpd cprid cpd
CPU 4:
CPU 5:
CPU 6:
CPU 7:
CPU 8:
CPU 9:
CPU 10:
CPU 11:
CPU 12:
CPU 13:
CPU 14:
CPU 15:
All:
The current license permits the use of CPUs 0, 1, 2, 3 only.
We have a very sporadic issue where I believe there is a load problem on policy deployment where at least one of the CPU's is getting maxed out and can't cope which has a knock on effort on various processes not functioning properly.
I can see that one of the ksoftirqd/0 process has clearly been running. Further background below but having looked at a top on deployment (when there is much lower load and therefore no problem) I see several of the daemons highlighted in bold above run at a significant amount of CPU for a second or two.
Our initial thinking is to use fw ctl affinity -s -n ........to move these processes onto the 12 CPUs that are not licensed thus taking the strain off the SND and three workers at point of policy deployment.
Can anyone confirm that these processes will work on unlicensed Cores without issue?
(I'm also planning on dropping to 2 workers and splitting out the SND to 2 processors for further balancing in case anyone suggests it)
Background info for those interested:
The firewall used to run around 120,000 concurrent connections until recent client device changes where we are now averaging around the 170,000 concurrent connections.
Occassionally on policy deployment we now see a problem whereby the Active member has significant stability issues for a period of time which can be anywhere from 10 minutes to 2.5 hours. During this time it logs nothing, no cpview data. Nothing in messages. No access to CLI. DHCP relay fails. but it continues to pass traffic albeit with more latency causing general slowness. When it recovers itself and start responding again it logs that it has restarted multiple fwd processes
[22 May 14:26:41] fwd: pid 4413 is not responding, killing process
[22 May 14:26:41] fwd: pid 4436 is not responding, killing process
[22 May 14:26:41] fwd: pid 4437 is not responding, killing process
[22 May 14:26:41] fwd: pid 4483 is not responding, killing process
[22 May 14:26:41] fwd: pid 4503 is not responding, killing process
[22 May 14:26:41] fwd: pid 17942 is not responding, killing process
[22 May 14:26:41] fwd: pid 20064 is not responding, killing process
[22 May 14:26:41] fwd: pid 20065 is not responding, killing process
[22 May 14:26:41] fwd: pid 20076 is not responding, killing process
top shows all is normal but the load average is massive but quickly drops.
TAC call raised several times but with no logs outputted during the issue it is difficult to be come up with any hard solutions and there is no appetite to deliberately pick a busy time and try and deployment with lots of debugging on (assuming it continued to respond) as the instability is very service affecting and will go on for as long as it takes which as I say can be hours.)
Thanks all, hope you are having a good day!