VSX cluster questions: Failover & updatable objec...

Scottc98 · ‎2023-06-13

Hello,

I've been setting up a lab to get myself more hands on experience with VSX Clusters and VSLS (R81.10 Take 95)

It took a moment to get all of the blade updates and after reading through sk106496 (Software Blades updates on VSX - FAQ), i do see that the majority of the blade updates are 'proxied' in some sort via the VSX gateways. On the same SK and in regards to updatable objects, it does mention that the VS itself needs direct access to the internet for updates.

I got some initial errors in my threat policy install since i had some exceptions i was using already that used these objects.

I did take a look at sk121877 (Package of Updatable Objects is missing on the Security Gateway) and did notice that the files were missing within the VS. After getting internet access resolved, it did indeed update and no more errors.

But.....my issue i have is: How do you resolve this on the Standby VS? I still get policy install errors on my 2nd VSX node since that doesn't have any updates and since only one VS is technically active, how do you ensure that the updates are present in the event you need to fail over? Is there any synchronization of these updates between the VS active/standby members? I'm temped to shut down one member here and see if it eventually updates on the 2nd box but its seems like I am missing something here.

In regards the cluster failovers, it does feel very similar to a typical ClusterXL setup to failover the main box (VS0). It felt a little weird to failover VS0 and then still see the single VS i had (VS1) still running as active there. But....i do understand from reading further SKs (SK95133) that this is normal and you can failover an individual VS. So kinda cool to jump in to each virtual system and failover one by one.

Now...my question 🙂 I am use to a non-VSX cluster where I can set the priority of the cluster members; allowing me to failover from member A to member B....AND keep Member B as the active node when member A comes back up.

What I noticed with both VS0 and VS1 is that they both go back to member A when that node is active. Is this by design and is this configurable? I can possibly see on the VS themselves to balance load in a VSLS setup but I am curious to the sticky priority of VS0. I am trying to understand more about the typical patching situation where you would want to move over everything active off one chassis gracefully so you can patch/reboot it.

I see sk56060 as a reference point and it does note that "If there are only two physical VSX members, then the simplest way would be to run the clusterXL_admin down command (refer to sk55081: Best Practices - Manual fail-over in ClusterXL) on the VSX cluster member, from which we move the instances of Virtual Systems in Active state." If i do this on the VS0 and VS1 here to isolate a node (Member A for example), wouldn't that member just come back as "active' after reboot? Or do you use the '-p' option flag so the cluster stays down after reboot?

Is the "vsx_util redistribute_vsls" method they mention the best way to go then each time to move everything to one cluster member and then move to to the other node; finally reverting to normal at the end to 'balance' back?

The main situation is in regards to upgrades and JHF patching here so if there is an SK that direct that, I'll take a look 😉

I know that's a lot here so appreciate any time taken to review and lead me in the right direction 😉

PhoneBoy · ‎2023-06-16

These updates are not synced between cluster members.
Each member is expected to download updates for Updatable Objects and the like on it's own.
You may need to adjust the configuration to allow this.
I believe this also applies to VSX as well, but it might be worth checking with TAC: https://support.checkpoint.com/results/sk/sk43807

On the other questions you have, I defer to those who are a bit more intimate with VSX than I am 🙂

Scottc98 · ‎2023-06-21

Thanks again @PhoneBoy I was thinking about the no_hide_nat as a workaround i have seen in previous non-vsx clusters but couldn't find that SK (bookmarked now ;)).

I did go through all of the steps and tried each one to no avail. I did have to go with the 'no_hide_nat' from the management server side and that seemed to fix my problem.

One question i have: Are the other 3 items in the SK required when doing the 'no_hide_nat' change? My though was how this would affect other clusters in a prod environment if you had to proceed this way. The management change is global and therefore would be published to other clusters on the same code train the next policy install.

PhoneBoy · ‎2023-06-21

You mean adding the services as "sync on cluster"?
Don't believe this is strictly required.

Scottc98 · ‎2023-06-21

Referring to step 3 and "fwha_forw_packet_to_not_active"

The 'sync on cluster' in step 2 was set as defaults.

The modification in step 4 on the table.def is global to my understanding. So if it fixes one cluster, how does it affect the others? And is setting 'fwha_forw_packet_to_not_active=1' a requirement for the others to have to consider?

PhoneBoy · ‎2023-06-22

Step 4 explicit says "If the above steps do not resolve the issue" which means it shouldn't be necessary.
However, it seems like Step 3 (setting fwha_forw_packet_to_not_active) may be required.
Having said that, it's currently working without doing either, correct?

JozkoMrkvicka · ‎2023-06-21

Regarding cluster failover part... yes, the "vsx_util redistribute_vsls" is best option how to have control over VSs. You can set all VSs to be active on 1 member, do things (upgrade, install hotfix, RMA,...) on member 2. Once all VSs on member 2 are standby, just use "vsx_util redistribute_vsls" from management to switch status of all VSs to be active on member 2. Once you are good, at the end, redistribute VSLS load equally between both members.

for VS0 "automatic" failover, this is done by selecting proper HA mode. There are 2 possible:

Active-UP

Primary-UP

You can check which status your VSX is using by running "cphaprob stat" from VS0.

Active-UP means the VS0 will maintain 1 member as always up. Once member 1 is standby, automatic failover will occur to switch member 1 as active member.

Primary-UP means the VS0 will maintain current member as active even second member is up and standby. Failover will NOT happen in this mode.

To change HA state between Active-UP and Primary-UP, you have to use "vsx_util vsls" ulitity on management and select proper option.

Kind regards,
Jozko Mrkvicka

Are you a member of CheckMates?

VSX cluster questions: Failover & updatable objects