Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
cosmos
Collaborator

VSX changes not published, gateways out of sync with CMA

Hi CP fans...

Not sure where to post this as the "more" option in the community "Select Location" dropdown renders this page useless, another pink problem but minor in comparison to this.

Problem: New VS created on VSX gateway cluster. Process succeeds, and the new VS exists on the cluster with InitialPolicy and trust established. Only the console did not publish the changes, we could see 21 unpublished changes in the console.

We go to manually publish the changes, and notice the publish dialog is pre-populated with the following details:

Session Name: VSX Configuration

Description: Published automatically

On publishing, the following console error is displayed: Publish Failed com.checkpoint.management.coresvc.ObjectNotFoundException: Object not found - [some-UUIID]

The session was discarded, with the VS configured on the gateway but the VS is not in SmartConsole due to the discarded changes that could NOT be published. Now I have a VS I cannot manage or install policy on, nor can I delete because it does not exist in the console.

I'm having to log support calls to do seemingly normal things with this product, it more than doubles the time we anticipate on these jobs.

Is there any 'normal' way to recover from this situation, e.g. manually remove the VS from the gateway or get the CMA to pull the config back? Or is it another half day open-heart surgery?

0 Kudos
6 Replies
Kaspars_Zibarts
Authority
Authority

Option 1: roll back to backed up state both management and all VSX cluster members

Option 2: you "re-build" each cluster member one by one from current mgmt using vsx_util reconfigure tool (it's a two step process: first you remove all VSX config from gateway by reset_gw command, and then you run vsx_util reconfigure on the management to push out "current" VSX config.

I realise it might sound drastic, but I have run it hundreds of times and it restores rather "clean" state between mgmt and VSX

the_rock
Authority
Authority

I have not dealt with VSX in a long time, but correct me if Im wrong...reset_gw, doesnt that wipe out all VSX config?

0 Kudos
Kaspars_Zibarts
Authority
Authority

@the_rock - correct, it does. That's the whole point, you get a clean slate to apply config again from mgmt using vsx_util 🙂

the_rock
Authority
Authority

Right, I recall now...so its not like typical clean install, more of reconfigure vsx config/

0 Kudos
cosmos
Collaborator

I don't know whether to laugh or cry. This was supposed to be a simple task, and previous experience with the above process on HFA91 was a nightmare (I still have the grey hair logs from 6 months ago). We had to run reset_gw at least twice (consistently, both on R77.30 and R80.40 gateways), and from what I understand the process still leaves bits on the gateway, it knows it's VSX (i.e. vsx stat runs) so it's not exactly a fresh box. I've also had issues running vsx_util reconfigure following a reset_gw due to some leftover bits on the gateway under the context directory (/opt/Cpshrd-[version]/CTX/CTX000nn/stuff).

# reset_gw

Cleaning database  [Error]

Reset Gateway operation terminated

# reset_gw

Cleaning database  Error: There is a static or default route by name for interface wrp1

Error: Invalid MTU. Allowed MTU range is 68-16000

Error: libdb_do_transaction: connection closed during operation

Error: Couldn't connect to /tmp/xsets:  Connection refused

Removing VSX directories  [OK]

Reset Gateway operation finished successfully

 

I have also rebuilt the environment in several labs (both physical and virtual) using vsx_util reconfigure several times, which was successful when the customer's MDS was running R77.30. Since the MDS upgrade to R80.40 the process consistently fails - either the final step of the operation times out (after an hour or so and many Virtual Systems), or when it succeeds the Virtual Systems on the gateway have No Trust regardless of all documented SIC reset processes (including sk168393). This is another ongoing issue that TAC or PS have not been able to resolve, they say it's hardware related and FWIW have sent us new 7k's to rebuild the lab. Same process, expecting a different result? There's a term for that.

This being the case, I'm not willing to run vsx_util reconfigure if it will result in a worse situation than we are in now.

Admittedly the customer is running HFA91, which was the latest at the time we upgraded the MDS and staged the new gateways. It's critical infrastructure under strict change control so hotfixes are not a simple process, but since I have a major migration next week to the new cluster (the one where I cannot delete the VS) I think we'll have to do an emergency change to upgrade management (4 multi-domain servers) and the gateways, before I even consider running reset_gw or vsx_util reconfigure.

I budgeted 2 hours to build the VS and run the tests, it will probably end up being several days coordinating changes with emergency CAB and rebuilding things.

Kaspars_Zibarts
Authority
Authority

I hear you @cosmos ! on occasion I have had strange issues with reset_gw and normally running it second time all suceeded. It does leave VSX on - that's expected. It only removes any VSX config that's pushed from mgmt - VSes, interfaces on them and associated routes. The base config remains, i.e. your physical interfaces and mode is left VSX.

How many VSes we are talking about roughly? Ours is sub 50 and works (mostly) without any major issues (from R67 till now R80.40 T120). What HW?

And I hear you about strict change control, same here. So often we end up in catch 22 situation or chicken and egg - our CAB won't let us upgrade before root cause is found and Checkpoint won't do any investigation unless upgraded.. hehe

One recent weird experience was with packet loss between VSX cluster members - vsx_util reconfigure kept failing cant remember which step but one of the last ones I believe) and it turned out it packet loss was around 12% due to oversubscribed trunk in core routers..

Fingers crossed that the newer HFA helps your case!

0 Kudos