(R77.30) coreXL changes with minimal disruption

Ville_Laitinen · ‎2018-10-13

I'm in a situation where the SNDs need more cpus and CoreXL instances need to be reduced, easy yes ?

Yes, unless you want to minimise the disruption...

HA considers the node with more CoreXL instances 'upgraded', hence when the other node is rebooted

after reducing corexl instances it will go straight to active without bothering with the annoying session sync thingy.

Is there a reasonably safe way to at least get some connections to persist without turning accept_non_syn_tcp on ?

The best i could come up is:

-force fwha_version on the first node to be modified higher than the non modified node, version upgrades are always considered first so this will force the changed node to go into 'ready' state instead of active after reboot.

-run fw fcu (not at all supported according to docs but seemed to mostly work in lab) on the 'ready' node before rebooting the other to activate corexl changes...remembering to change fwha_version back to what it should be.

Constantin_Pop · ‎2018-10-14

Hi Ville,

I remember trying something similar, reducing the number of fw instances with minimal traffic disruption but not much luck. You could try Temporarily disabling Stateful Inspection - Dropping out of state packets or enable the GW to send a RST for those connections, but you might be better off just getting short maintenance window.

I would also recommend upgrading to R80.10 with the latest Jumbo HF GA before changing the SND/FW workers core distribution - I see better performance than R77.30.

I assume you already checked sk98348:

It is recommended to allocate an additional CPU core to the SND only if all of the following conditions are met:
- Your platform has at least 8 CPU cores.
- The 'idle' value (run 'top' command and press 1 to display all CPU cores) for the CPU core currently running the SND is in the 0%-5% range.
- The sum of the 'idle' values (run the 'top' command and press 1 to display all CPU cores) for the CPU cores running CoreXL FW instances is significantly higher than 100%.
If any of the above conditions are not met, the default configuration of one processing core allocated to the SND is sufficient, and no further configuration is necessary.

Timothy_Hall · ‎2018-10-14

ClusterXL members with a different number of kernel instances (firewall workers) allocated will not be able to sync with each other while their configurations are mismatched; when the initially-modified member reboots and comes back up it is supposed to go into a "Ready" state but I've seen some odd things happen with traffic anyway. As Constantin Pop‌ mentioned unchecking "Drop out of state TCP packets" under the Global Properties for Stateful Inspection (and reinstalling policy) ahead of time is advised. This will blunt the impact of a non-synced failover by allowing previous connections to continue through the newly-active cluster member even though its state table is initially empty. Once the second member has been modified and rebooted it will come back up, take a full sync, then rejoin the cluster normally. Just don't forget to recheck the box (and reinstall policy) when you are done!

Also if you have Multi-Queue enabled on any interfaces you will need to run cpmq reconfigure on each member after both members have had their kernel instance number updated and reboot them individually. So you must reconfigure CoreXL, reboot, run cpmq reconfigure then reboot again on all members. You can *not* avoid doing two reboots in this case, don't try to shortcut this procedure or you will regret it.

--
Second Edition of my "Max Power" Firewall Book
Now Available at http://www.maxpowerfirewalls.com

Attend my 60-minute "Be your Own TAC: Part Deux" Presentation
Exclusively at CPX 2025 Las Vegas Tuesday Feb 25th @ 1:00pm

Ville_Laitinen · ‎2018-10-14

Thanks for the helpful replies.

I did realize i made a wrong assumption about the manually forced sync, while it worked in lab it was likely because

all traffic was being accelerated and trying this in production could end up with a crash, worst case... and an unknown state at best.

Turning off stateful inspection seems like the only way to approach this (I omitted it from the initial post because i wanted to hear if there were other choices

However even with stateful inspection turned off the first failover will still be somewhat uncontrolled as the modified node in this case has fewer workers so it will boot straight to active unless it is somehow forced to be down (or into ready state with the ha version change)

Are you a member of CheckMates?

(R77.30) coreXL changes with minimal disruption