Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Wolfgang
Authority
Authority

Maestro VSX VSLS (dual site), behaviour if WAN link is broken

How about the behaviour if the WAN link between a dual site Maestro environment will be lost?

Running VSX on Maestro in VSLS mode allows to run virtual system A active on site1 and virtual system B active on site 2. If the connection between both sites will be lost both virtual system staying active on their sites.

Is it possible to stop or start the virtual systems only on one site (need for management or not)?

How about if the management can reach only one site in this case?

Any experience with this?

0 Kudos
10 Replies
Chris_Atkinson
Employee Employee
Employee

 

Hopefully situations where multiple diverse fibre paths encounter concurrent failures is rare or you have designed accordingly. 

Note I've not personally tried the following process on Maestro, nor does it recommend it for other than restarting a VS.

sk169472: How to restart a specific VSX Virtual System in R80.30 and higher

CCSM R77/R80/ELITE
0 Kudos
Wolfgang
Authority
Authority

@Chris_Atkinson 

This question is more for a disaster not a real link failing. With the loss of all links we have into a Split Brain situation, which must be solved. That's why I'm asking. The other case ... if one of the datacenters goes down and not all of the virtual systems could fail over to the active one.  Will be sk169472 the supported way to bring a offline VS up?

0 Kudos
Chris_Atkinson
Employee Employee
Employee

Was discussing this with a colleague earlier prompting me to revisit this thread.

Most likely you would be forcing the system down on the least preferred site.

Standby would attempt to become active upon Sync loss is the expectation (assuming subsequent discovery via uplinks fails).

@Lari_Luoma Do you have any additional knowledge to share here?

CCSM R77/R80/ELITE
0 Kudos
Bob_Zimmerman
Authority
Authority


@Chris_Atkinson wrote:

Hopefully situations where multiple diverse fibre paths encounter concurrent failures is rare or you have designed accordingly.


In my experience, telcos lie all the time about the physical paths they use for links you pay for. My current company pays for WAN circuits from several telcos which ostensibly all take physically diverse paths leaving from different ends of the datacenters and going in different directions. I've lost count of how many times we've found out all these fibers actually go through a single piece of conduit when a backhoe takes it out.

Twice, we've found that one telco was reselling service from the other telco, and all four of the "physically diverse, carrier diverse" WAN circuits out from one of our datacenters terminated on the same line card of a single telco-side router. Twice.

It's great to specify to the telcos that these circuits are meant to be redundant, so make sure they don't depend on some single device on the telco side. Still a good idea to plan for what to do if all the WAN links fail at the same time.

0 Kudos
Chris_Atkinson
Employee Employee
Employee

Agree, Autonomous Datacenters as a concept exist for a reason.

CCSM R77/R80/ELITE
0 Kudos
Dario_Perez
Employee Employee
Employee

if you lost a link on site which is active the VS, then would fail-over to other site.

you car Start/Stop VS for 3.10 kernel only with cpstop/cpstart on VS. also you can use clusterXL_admin down on vs context 

 

Management have communication with SMO 

to change the priority from site 1 to site 2 per VS you don't need the management 

is security group configuration 

set chassis high-availability mode 3  ->vsls

set chassis vsls system primary_chassis 0/1/2  ->default/site 1 site 2

set chassis high-availability vs chassis priority -> weight per vs

0 Kudos
Marco32
Contributor

Hi Dario, plese help me to better understand this point:

 

> set chassis high-availability mode 3 ->vsls

In this way I can use sgm on both site for my security group and have some VS's active on site 1 and some other VS's active on site 2. This command is executed on VS 0

 

> set chassis vsls system primary_chassis 0

In this way the VS's will have primary_chassis on site 1 or on site 2 based on their ID. Is this mean that in standard behavior the VS will run on its primary_chassis? This command is executed on VS 0

 

> set chassis high-availability vs chassis priority "1 2"

This command (executed on VS) say's that VS will run first on chassis 1 (for ex.) and if it fails will run on chassis 2

 

If I need to move VS running on chassis 1 to chassis 2 I have to use

> set chassis high-availability vs chassis priority "2 1"

 

Regards

M.

0 Kudos
Lari_Luoma
Ambassador Ambassador
Ambassador

Let me also add here that if you lose sync between the sites in dual site Maestro, the system will go into so called SYNC_LOST state and the sites freeze their status, so there won't be split brain. They will continue to forward traffic, but won't be syncing (because the sync is down).

0 Kudos
Wolfgang
Authority
Authority

Thanks @Lari_Luoma for the detailed description.

Just to clarify and understanding correctly... If we are in SYNC_LOST state and the sites freeze their status, VS1 run will be active on Site 1 and VS2 will be active on site 2. The standby VS are vice versa. Now we want to switch site1 offline and site2 active.

Starting all VSs on site2 to active state can be done with cpstart in context of every VS?

All VSs on site 1 can be stopped with cpstop?

0 Kudos
Lari_Luoma
Ambassador Ambassador
Ambassador

@Wolfgang After discussing with my colleagues and checking some documents I'm editing this answer with updates.

In VSX SYNC_LOST messages are sent via all interfaces in VS0. In practice this means the management interface. So, in order for the lost sync mechanism to work correctly, the management interface needs to be connected on both sites (documented limitation). Sites will freeze their status for as long as they receive SYNC_LOST messages from the other site.
If your sync is totally lost and you want to make sure that all traffic only flows via the current primary site, the best way to achieve this in my opinion is to cpstop or shut down the secondary site (not individual VSs). However, make sure you won't start it until the sync is back up.

0 Kudos