Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
RickHoppe
Advisor

CoreXL affinity on VSX R80.10 does not survive reboot?

I already submitted a case to TAC but I'm wondering if the community experiences the following behaviour too.

It appears we now have two customers with the same issue on VSX R80.10.

Customer 1: VSX cluster which is recently upgraded from R77.30 to R80.10 with JHF Take 154.

Customer 2: New VSX clusters on R80.10 with JHF Take 167, not in production yet.

CoreXL is disabled on VS0. SecureXL is enabled on all Virtual Systems.

SIM affinity and CoreXL affinity was set with the usual commands. But it appears that after the upgrade of Customer 1 and planned reboots of the nodes of Customer 2 (not in production yet) the affinity config is not applied. When checking this with "fw ctl affinity -l -a" we see the configured affinity.

But a "fw ctl affinity -l -x -vsid 1 -flags tn" shows that the default affinity is applied. The * in the V column confirms that the actual affinity is different than the configured affinity.

I can change affinity again to fix this but it is only a temporary fix as a reboot resets everything to default again,


Example of the commands that were given at Customer 2:

sim affinity -s

fw ctl affinity -s -d -vsid 2-3 -cpu 2-3


If I'm not mistaken the -d option is meant to set this permanently, right?

Anyone else who've seen this in their environments? 

My blog: https://checkpoint.engineer
21 Replies
_Val_
Admin
Admin

Sounds beyond weird. Can you actually post the commands and output here?

0 Kudos
RickHoppe
Advisor

Sure.

This is what I've done for Customer 2. It's a VSX cluster on 5600 appliances, with only 4 CPU's and 4 bonds (SYNC is also part of a bonding group). As it's not production yet I've changed affinity a bit again to make my issue better visible.

#sim affinity -s
Usage : For each interface enter one of the following:
Return - To keep the default values (appearing in [ ])
all - To allow all processors for this interface
List of processors - A list of processor numbers between 0 and 3

Mgmt [0] : 0
Sync [0] : 1
eth1 [0] : 0
eth2 [0] : 0
eth3 [0] : 0
eth4 [0] : 0
eth5 [0] : 1
eth6 [0] : 1
eth7 [0] : 1

VS0 = Mgmt (offcourse)

VS1 = Virtual Switch

VS2 - VS3 = Virtual Systems

# fw ctl affinity -s -d -vsid 0-1 -cpu 2
VDevice 0-1 : CPU 2 - set successfully
Multi-queue affinity was not changed. For More info, see sk113834.
# fw ctl affinity -s -d -vsid 2-3 -cpu 3
VDevice 2-3 : CPU 3 - set successfully
Multi-queue affinity was not changed. For More info, see sk113834.

Affinity configuration will be shown as:

# fw ctl affinity -l -a
eth1: CPU 0
eth5: CPU 1
eth2: CPU 0
eth6: CPU 1
eth3: CPU 0
eth7: CPU 1
eth4: CPU 0
Sync: CPU 1
Mgmt: CPU 0
VS_0: CPU 2
VS_0 fwk: CPU 2
VS_1: CPU 2
VS_1 fwk: CPU 2
VS_2: CPU 3
VS_2 fwk: CPU 3
VS_3: CPU 3
VS_3 fwk: CPU 3


Actual affinity (before reboot) is shown as:

# fw ctl affinity -l -a -x -vsid 1
---------------------------------------------------------------------
|PID |VSID | CPU |SRC|V|KT |EXC| NAME
---------------------------------------------------------------------
| 17381 | 1 | 2 | V | | | | snmpd_1
| 19638 | 1 | 2 | V | | | | fwk_wd
| 19693 | 1 | 2 | V | | | | fwk
| 19942 | 1 | 2 | V | | | | cpd
| 19944 | 1 | 2 | V | | | | fwd
| 20013 | 1 | 2 | V | | | | cpviewd
---------------------------------------------------------------------
PID - represents the pid of the process
VSID - represents the virtual device id
CPU - represents the CPUs assigned to the specific process
SRC - represents the source configuration file of the process - (V)SID / (I)nstance / (P)rocess
V - represents validity,star means that the actual affinity is different than the configured affinity
KT - represents whether the process is a kernel thread
EXC - represents whether the process belongs to the process exception list (vsaffinity_exception.conf)

# fw ctl affinity -l -a -x -vsid 2
---------------------------------------------------------------------
|PID |VSID | CPU |SRC|V|KT |EXC| NAME
---------------------------------------------------------------------
| 4881 | 2 | 3 | V | | | | routed
| 6486 | 2 | 3 | V | | | | fwd
| 17402 | 2 | 3 | V | | | | snmpd_2
| 19691 | 2 | 3 | V | | | | fwk_wd
| 19696 | 2 | 3 | V | | | | fwk
| 19947 | 2 | 3 | V | | | | cpd
| 20030 | 2 | 3 | V | | | | cpviewd
| 20034 | 2 | 3 | V | | | | mpdaemon
| 20036 | 2 | 3 | V | | | | ci_http_server
| 28911 | 2 | 3 | V | | | | cphamcset
---------------------------------------------------------------------
PID - represents the pid of the process
VSID - represents the virtual device id
CPU - represents the CPUs assigned to the specific process
SRC - represents the source configuration file of the process - (V)SID / (I)nstance / (P)rocess
V - represents validity,star means that the actual affinity is different than the configured affinity
KT - represents whether the process is a kernel thread
EXC - represents whether the process belongs to the process exception list (vsaffinity_exception.conf)

# fw ctl affinity -l -a -x -vsid 3
---------------------------------------------------------------------
|PID |VSID | CPU |SRC|V|KT |EXC| NAME
---------------------------------------------------------------------
| 6254 | 3 | 3 | V | | | | fwd
| 17425 | 3 | 3 | V | | | | snmpd_3
| 19605 | 3 | 3 | V | | | | fwk_wd
| 19692 | 3 | 3 | V | | | | fwk
| 19893 | 3 | 3 | V | | | | cpd
| 19902 | 3 | 3 | V | | | | cpviewd
| 19904 | 3 | 3 | V | | | | mpdaemon
| 19938 | 3 | 3 | V | | | | ci_http_server
| 24492 | 3 | 3 | V | | | | cphamcset
| 29680 | 3 | 3 | V | | | | routed
---------------------------------------------------------------------
PID - represents the pid of the process
VSID - represents the virtual device id
CPU - represents the CPUs assigned to the specific process
SRC - represents the source configuration file of the process - (V)SID / (I)nstance / (P)rocess
V - represents validity,star means that the actual affinity is different than the configured affinity
KT - represents whether the process is a kernel thread
EXC - represents whether the process belongs to the process exception list (vsaffinity_exception.conf)

Now, when I perform a reboot the affinity configuration is still the same (as expected):

# fw ctl affinity -l -a
eth1: CPU 0
eth5: CPU 1
eth2: CPU 0
eth6: CPU 1
eth3: CPU 0
eth7: CPU 1
eth4: CPU 0
Sync: CPU 1
Mgmt: CPU 0
VS_0: CPU 2
VS_0 fwk: CPU 2
VS_1: CPU 2
VS_1 fwk: CPU 2
VS_2: CPU 3
VS_2 fwk: CPU 3
VS_3: CPU 3
VS_3 fwk: CPU 3

But now, after the reboot, the actual affinity differs from the configured affinity as confirmed by the * in the V column:

# fw ctl affinity -l -a -x -vsid 1
---------------------------------------------------------------------
|PID |VSID | CPU |SRC|V|KT |EXC| NAME
---------------------------------------------------------------------
| 14112 | 1 | 2 3 | V |*| | | snmpd_1
| 16657 | 1 | 2 3 | V |*| | | fwk_wd
| 16660 | 1 | 2 3 | V |*| | | fwk
| 18413 | 1 | 2 3 | V |*| | | cpd
| 18415 | 1 | 2 3 | V |*| | | fwd
| 18608 | 1 | 2 3 | V |*| | | cpviewd
---------------------------------------------------------------------
PID - represents the pid of the process
VSID - represents the virtual device id
CPU - represents the CPUs assigned to the specific process
SRC - represents the source configuration file of the process - (V)SID / (I)nstance / (P)rocess
V - represents validity,star means that the actual affinity is different than the configured affinity
KT - represents whether the process is a kernel thread
EXC - represents whether the process belongs to the process exception list (vsaffinity_exception.conf)

# fw ctl affinity -l -a -x -vsid 2
---------------------------------------------------------------------
|PID |VSID | CPU |SRC|V|KT |EXC| NAME
---------------------------------------------------------------------
| 2380 | 2 | 3 | V | | | | cphamcset
| 2400 | 2 | 2 | V |*| | | routed
| 14149 | 2 | 2 3 | V |*| | | snmpd_2
| 16598 | 2 | 2 3 | V |*| | | fwk_wd
| 16650 | 2 | 2 3 | V |*| | | fwk
| 18860 | 2 | 2 3 | V |*| | | cpd
| 19114 | 2 | 2 3 | V |*| | | fwd
| 19691 | 2 | 2 3 | V |*| | | cpviewd
| 19707 | 2 | 2 3 | V |*| | | mpdaemon
| 19709 | 2 | 2 3 | V |*| | | ci_http_server
---------------------------------------------------------------------
PID - represents the pid of the process
VSID - represents the virtual device id
CPU - represents the CPUs assigned to the specific process
SRC - represents the source configuration file of the process - (V)SID / (I)nstance / (P)rocess
V - represents validity,star means that the actual affinity is different than the configured affinity
KT - represents whether the process is a kernel thread
EXC - represents whether the process belongs to the process exception list (vsaffinity_exception.conf)

# fw ctl affinity -l -a -x -vsid 3
---------------------------------------------------------------------
|PID |VSID | CPU |SRC|V|KT |EXC| NAME
---------------------------------------------------------------------
| 14230 | 3 | 2 3 | V |*| | | snmpd_3
| 16579 | 3 | 2 3 | V |*| | | fwk_wd
| 16649 | 3 | 2 3 | V |*| | | fwk
| 18628 | 3 | 2 3 | V |*| | | cpd
| 18859 | 3 | 2 3 | V |*| | | fwd
| 21124 | 3 | 2 3 | V |*| | | cpviewd
| 21285 | 3 | 2 3 | V |*| | | mpdaemon
| 21293 | 3 | 2 3 | V |*| | | ci_http_server
| 24428 | 3 | 3 | V | | | | cphamcset
| 32591 | 3 | 2 | V |*| | | routed
---------------------------------------------------------------------
PID - represents the pid of the process
VSID - represents the virtual device id
CPU - represents the CPUs assigned to the specific process
SRC - represents the source configuration file of the process - (V)SID / (I)nstance / (P)rocess
V - represents validity,star means that the actual affinity is different than the configured affinity
KT - represents whether the process is a kernel thread
EXC - represents whether the process belongs to the process exception list (vsaffinity_exception.conf)
My blog: https://checkpoint.engineer
_Val_
Admin
Admin

Hi Rick Hoppe‌, that is more than enough.

Your issue is described in sk130432. You either need a higher Jumbo or a hotfix. Hope this helps. 

0 Kudos
RickHoppe
Advisor

Thanks! Not sure why I did not find that SK myself. I've updated the service request at TAC and asked for the hotfix. Will let you know the results as soon I have the hotfix.

My blog: https://checkpoint.engineer
0 Kudos
_Val_
Admin
Admin

No worries. I have looked into this SK twice, but could not figure out it is related to your case, before the output. 

Let me know how it goes

Kaspars_Zibarts
Employee Employee
Employee

Very very peculiar! Thanks for sharing. Will have to check ours tomorrow. Got rather worried now but I also know that from CPU graphs all should be ok. 

0 Kudos
RickHoppe
Advisor

In related posts I saw an earlier post created by you: CoreXL gone rogue on VSX after Take 112 . Looks like the same issue. And we both implemented the same workaround: Rerun the affinity commands.

I'm curious if you are able to replicate it with a reboot.

My blog: https://checkpoint.engineer
0 Kudos
Kaspars_Zibarts
Employee Employee
Employee

Yeah...  command output does not look good at all - exactly like you pointed out. As in configured does not match actual.

BUT. I haven't noticed anything in our CPU graphs! They have the profile matching configured affinity! So technically all is running as expected. I would say that's why we haven't noticed anything.

We are on take 142. Actually 3 out of 4 gateways say that affinity does not match configured and one seems ok. But I believe I messed around with that one and reset affinity manually at some point as we were testing multiqueue on one of the bonds.

I'll upgrade to jumbo 154 and see what happens then.

Valeri Loukine‌ could you check with R&D when the fix will be included in jumbo? Basically mismatch between set and actual affinity on VSX after reboot

0 Kudos
_Val_
Admin
Admin

R&D and TAC are looking in the author's case already. Give us some time to get results, please

0 Kudos
Kaspars_Zibarts
Employee Employee
Employee

Awesome! Let us know if they want any logs / evidence!

0 Kudos
RickHoppe
Advisor

Not sure if upgrading your environment to JHF Take 154 will solve it as in our case both customers are already on JHF Take 154 or higher. 

Customer 1 is on JHF Take 154.

Customer 2 is on JHF Take 167.

The output I posted came from Customer 2.

The case at TAC is still under investigation with R&D as they firmly believe sk130432 is not applicable.

My blog: https://checkpoint.engineer
0 Kudos
Kaspars_Zibarts
Employee Employee
Employee

I noticed that. I was meant to say I want to see what happens with remaining VSX that has affinities in good state after 154 update / reboot Smiley Happy

0 Kudos
_Val_
Admin
Admin

Rick, could you please send me your SR number in a private message?

0 Kudos
RickHoppe
Advisor

Done.

My blog: https://checkpoint.engineer
Kaspars_Zibarts
Employee Employee
Employee

Rick - did you get to the bottom of this?

0 Kudos
RickHoppe
Advisor

It makes me really said to say but - no.

The increased transparency, as promised with BEYOND, seems not to be applicable to this case as we only receive updates that they are working on the replication. No further details.

My blog: https://checkpoint.engineer
0 Kudos
Kaspars_Zibarts
Employee Employee
Employee

I just had it on one of my VSX clusters that previously survived reboots without noticeable impact. This time there was weird increase in CPU usage on 4 cores that were supposedly allocated for SXL and MQ. But resetting VS0 affinities fixed the abnormal usage on those cores. So I can confirm we are affected by this too.

0 Kudos
RickHoppe
Advisor

Did you also submit a case to TAC?

My blog: https://checkpoint.engineer
0 Kudos
Kaspars_Zibarts
Employee Employee
Employee

Nope, was hoping yours was resolved and I could just get the answer from you Smiley Happy and now all the evidence is gone

0 Kudos
RickHoppe
Advisor

I still have one VSX cluster left that is not in production yet so I can show R&D what happens. Very easy to replicate by rebooting a node.

But appearently nobody is interested.

I’ll update this thread again when I have an answer. Hopefully this year though...

My blog: https://checkpoint.engineer
Sergio_Afonso_C
Participant

Hi Rick.

Did TAC give you a solution for your issue? We're facing the same one.....

Thank you!.

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events