Re: vsx_util change_interfaces

cosmos · ‎2021-07-13

You need to open the box
The box is locked
The key is inside the box
Goto 1.

This is my experience with vsx_util change_interfaces in both R77.30 and R80.40.

Use case: Migrating VSX gateways from CP appliances to open servers, where source interfaces are "ethN-nn" and destination interfaces are "ethNN".

For various reasons (default Gaia partitioning & lazy admin) the vsx_util change_interfaces process failed part way through and needs to complete. On the initial run, the option to "delete old interfaces" was selected, to ensure they are removed from the DB as they don't exist on the target hardware. The process seems to continue OK and reconfigure all the virtual systems in all domains, until we invoke vsx_util reconfigure to reconfigure the new gateway.

The process fails, because eth1-01 is in the database and not on the gateway. Does vsx_util have amnesia? I'm sure I told it to remove them.

We attempt to remove the old interfaces from the object in the GUI, however the push process fails. Why? Because "Initial policy is installed on the VSX".

fw unloadlocal to the rescue? Nope... new error this time, "Error retrieving policy installed on the VSX". I recall using dbedit on R77.30 to fix this a long time ago, however dbedit continuously crashed forcing us to restore the MDS from a backup to recover from this state.

_Val_ · ‎2021-07-13

It is highly recommended to get both MGMT and GW backups before any configuration change on VSX.

In your case, you need to remove vs_slot from DB. Are you using GUIdbedit or dbedit? If former, try the latter. TAC may also give you a hand here...

Pauli · ‎2021-07-14

In my experience, it is easier to export the VSX config with vsx_provisioningtool, adapt the interfaces and set up the new machine with the new config (import via vsx_provisioningtool) and new clusterobject. The Change is very simple - shutdown Interfaces old GW, activate Interfaces on new Gateway

Magnus-Holmberg · ‎2021-07-15

Making in that way also allows you to run it in parallel, making rollback if needed much easier and if also possible to rollback single VS if needed (if you use linknetworks to reach the vs and not vr).

only issue with this solution as I see it, is that you may need to generate a lot of demo licenses as a dmnvsx will not allow you to have 2 vs within the same cma.

well worth it and this is how we have replaced 10+ physical VSX clusters the last year.

https://www.youtube.com/c/MagnusHolmberg-NetSec

cosmos · ‎2021-07-15

You mean set vsx off, shutdown current cluster interfaces and enable new?

Export current VS config (vsx_provisioning_tool -o show vd name blah)
Adapt transactions for new interfaces (I use sed for this, and wrap the relevant transaction begin and end statements around the config)
Build new VS (vsx_provisioning_tool -f <transaction>)
Shutdown interfaces on new cluster (set vsx off, set interface bond0.blah state off, set interface wrpblah state off etc.)
Install policy on new cluster
Shutdown interfaces on old cluster
Enable interfaces on new cluster

My original plan was to export the VS configs, adapt interfaces and import to the new cluster running in parallel, and stop short of installing policy. During the maintenance window, stop each VS on the old cluster while policy gets installed on the new VS (a new VS doesn't get real IPs until it gets a policy). It's not elegant that's for sure.

To make matters worse, the MDS and MLMs are behind the Mgmt interface of the current cluster and reach the outside world via VS0 --> wrp link --> another VS, this is also our path in from management clients. So the migration also involves vsx_util change_mgmt_subnet (this allows both clusters to exist during transition) and moving vs0:wrp1's IP to the new cluster (via vsx_provisioning_tool). We will have direct access to that VLAN for the change fortunately.

I've resorted to sk86121 to rebuild the lab, which comes with its own personality (only the first 10 interfaces stick after reboot, on a box with 18 in use).

Magnus-Holmberg · ‎2021-07-16

No, i feel like you overcomplicate things.

I mean set it up in paralell with VS0 connection to the mgmt server.
Install it like you would any new cluster, fixing bonds, patches and all that is needed.

Run vsx_provisioning_tool on the old box to get out the configuration.
Edit in in notepad++ so you get correct VS name and interface number. (the vs name would need to be diff from excisting)
The MAIN IP would also need to be diff so you don't destroy any VPN tunnels.

Then run vsx_provisioning_tool on the new cluster and add all the VS.
But dont allow the VLAN on the BOND interfaces so production traffic is still on the old one.
Push the policy to the new VS to see that all looks good.

As they are diff VS0 these clusters should be able to run in paralell.
And if needed use demo licenses until you finished empty the cluster.

And when you are ready you remove the VLAN in the switches to the old VSX cluster and open it to the new VSX Cluster.
You will need to flip the main IP and such before the maintenance window, just make sure noone else is playing around in the mgmt server at this time.
Doing this way allows you to prepp all in advance and also rollback very easy.

https://www.youtube.com/c/MagnusHolmberg-NetSec

cosmos · ‎2021-07-19

Thanks Magnus. One cluster connects to several different switch fabrics all of which are managed by a separate team. The main fabric is Cisco ACI, managed via the APIC directly - while I would prefer to use the ACI API and prepare port profiles to add/remove VLANs from the network fabric (or modify the VLAN ACL from a non-ACI switch) unfortunately it's not that simple and we have to lock in our add/remove VLAN changes days in advance with that team.

Hence my desire to manage the interface state from the gateway - we've successfully done this on non-VSX gateways managed in-band (i.e. managed via a sub-interface on a bond) and was hoping to apply the same principle here, however it's not supported in VSX without either "set vsx off" (which I will most likely need) or stopping/starting individual Virtual Systems. I think my above plan is as simple as it can get, but will probably be best to shut down the new bond rather than each individual interface on the new cluster. The old cluster will still need 14 individual interfaces shut down. Due to the inherited, unnecessarily complicated path used to access the manager via a Citrix published app behind another VS, we will also need to flip the VS0 management IP to the new cluster, remove the default route and add a bunch of statics via that management VS - I have vsx_provisioning_tool transactions prepared for this and they work well (at least they did on the R80 build I tested 6 months ago, I have since found the provisioning tool output has changed just to keep us on our toes).

I use sed (Linux stream editor) instead of NP++/vscode as we have 50+ VS and several hundred interfaces to re-map (including access ports to a VLAN on the new port channel), with the number of lines automation is key to accuracy.

I find these jobs seem simple on the surface but once you get into the detail nothing works as expected (R80 change_interface script regressions from R77, bugs in fw vsx sicreset, an SK to resolve another SK that doesn't work in R80, renamed interfaces disappearing, output from Gaia show config not usable as input, unable to restore MDS because the source machine has _AUTOUDPATE hotfixes, the reconfigure script failing due to a bad copy of appi_db.C after reset_gw and having to find a good copy from a working VSX member of the exact same FW and HF version...).

cosmos · ‎2021-07-14

Thanks @_Val_ I might give that a go. We have backups of all shapes and sizes, I'm considering buying shares in storage companies 🙂

@PauliI totally agree - we're simulating current state in a lab for an extremely sensitive environment with a large number of virtual systems. We needed a near-identical replica to establish a series of baselines that match current state and successfully built it in 77.30. Management and gateway upgrades simulated in the lab, scripts automate and enrich provisioning tool transcripts and move trunks and untagged interfaces to bundles, manually handle anti-spoofing groups with exclusions, stuff like that. It may sound like a lot of effort but necessary to manage the risk.

While extremely finicky, storage hungry and time-consuming, this process worked great until we upgraded the production management to 80.40 and had the need to update/rebuild the lab. Changes to vsx_util have resulted in some interfaces missing from the change_interfaces operation.

I think renaming the interfaces per sk86121 is a better approach, at least to replicate current state. Target state uses bundles, as all good states should 🙂

cosmos · ‎2021-07-14

We also built run sheets for the changes, test plans, rollback, time the operations, establish comms plans, DR plans, all the good stuff. To migrate all virtual systems for one cluster with maximum resources applied to the MDS takes... hold on to your undies.... up to 12 hours! And because it's VSX that's several outages for flows that touch multiple VS. I don't know many people with that amount of patience for the colour pink haha

cosmos · ‎2021-07-14

Turns out the unlisted interfaces were still in the object after change_interfaces and the reconfigure failed.

We ended up changing the interface names per sk86121 however the change only survives one reboot - once booted the gateway has the new interfaces but we found the file got overwritten on the next reboot. The fix? chmod -w /etc/udev/rules.d/00-OS-XX.rules. I'm sure it's another bug but alas I have bigger fish to fry.

Bob_Zimmerman · ‎2021-07-15

Worth noting for the future: bonds avoid this issue. If the application config references only bonds, you can easily change at the CLI level which physical interfaces are members of which bond. A bond can be made up of one member interface, so you lose nothing by using them, and you gain flexibility to move things around.

cosmos · ‎2021-07-15

Absolutely, target state uses bundles (LACP LAGs with vPC on the switch end), as all good designs should 🙂

Also worth noting, if using Cisco ACI L3 outs over vPC, the SVI needs 5 IPs for a cluster spanning multiple DCs. We've had to expand some /29's to accommodate this for non-VSX clusters (5 for the vPC, 3 for the cluster). It's not a problem for VSX where the members use the internal communication network.

Bob_Zimmerman · ‎2021-07-16

Non-VSX firewalls can actually do the same off-net member IP thing which VSX does. The members need to be on the same network as each other, but you only need one address on the functional network for the VIP.

Are you a member of CheckMates?

vsx_util change_interfaces