R82 member unable to join cluster

marcyn · ‎2024-08-07

Hello,

I'm playing with ElasticXL in R82 which is great but it looks like that I encountered a little "bug" or better to say "specific situation that probably was not taken into consideration by R&D".

I have 2 appliances and I want to check couple of scenarios for them regarding ElasticXL:
1) both in one Site (LoadBalancing) = works flawlessly
2) 1st in Site1, 2nd in Site2 (HighAvailability) = works flawlessly
And these two scenarios are quite obvious ... because of that there should be no issue ... and there is not, great !

However ... I wouldn't be me if I did not check 3rd scenario...
3) 1st in Site1, 2nd in Site2 .... but not using Sync interface (cooper) but using different interface (fibre) for sync.

Why such scenario ?
Let's consider that we have 2 DCs separated geographically.
In DC#1 I want to have (up to 3) appliances that will act as LB, and in DC#2 I want to have (up to 3) appliances that will act also as LB ... but will be part of the same cluster.
In this scenario DC#1 (Site 1) will be active, and DC#2 (Site 2) will be standby.
This is in my opinion really revolutionary change that we have thanks to ElasticXL - one cluster that is LB and HA at the same time !

This 3rd scenario will not work with regular Sync interface (cooper) (cooper = 100m) and because of that here is what I did:
a) ElasticXL cluster was created (1st appliance)
b) I added eth1-01 (fiber) into Sync bonding group, and changed configuration of this group so that this eth1-01 should be primary
c) I powered up appliance2 (factory configuration)
d) because of factory configuration there is no bonding group, etc ... so I created bonding group with eth1-01 and Sync interfaces
e) I addressed this "bond1" interface as 192.0.2.254 ... so that it could be discovered by exl_detectiond
f) after that I connected these two appliances via eth1-01 using fibre cable
g) appliance1 discovered appliance2 and marked it as available member with state "request_to_join"

So far ... so good ... but ...

h) after I executed in gclish "add cluster member ...... site 2" bad things started to happen
i) state changed to "joining_cluster" and that's it ... nothing more ... it stucked

After couple of minutes of waiting I begun troubleshooting and discovered this on appliance2:

You can see above what's the problem ...
Sync interface (cooper) was addressed as 192.0.2.15 ... not this bond1.
Because of that process can not finish because this cooper has no link.

Question:
Is it possible to use this 3rd scenario in case NOT using cooper as Sync ?

For sure it will work in case I will have all of the appliances in the same DC, connect them via Sync (cooper), create a cluster (both Sites), then add fiber interfaces to sync bonding group ... and only after that I can move those appliances that should be in Site 2 (Standby) to DC#2.

I think you see what's the problem here 🙂

Just wondering if it is possible to do it in case appliances already are separated ?

Best
m.

emmap · ‎2024-08-07

Typically we would expect the Sync interfaces to run through a switching layer, as you can't directly connect more than 2 cluster members together and EXL is designed for more than 2 cluster members. Hence the available onboard sync interfaces are fine as the inter-DC links are not directly connected to the appliances.

marcyn · ‎2024-08-07

Yes, I agree that typically it will work as you mentioned (couple of appliances and switch - sync via cooper port).

In my 3rd scenario on-board Sync interface(s) will not be fine in all cases (lower models have only cooper sync interface, only higher have fiber).
In case appliances will be geographically separated (for example couple or even more kilometers) cooper will be not enough... and here fiber connection will be neccessary.
Hence in case models with cooper we need to add fibre interfaces to sync bonding group and use them as a sync between appliances.

I'm thinking about such a scenario:
DC#1 = 1-3 appliance (active)
DC#2 = 1-3 appliance (standby)
Here I can use direct fiber connection.... or use fiber switch as well.

I'm just thinking about using ElasticXL as LB+HA at the same time...
LB in each Site, HA between two Sites (two DCs).

BTW
In my lab I have 23500 model which has Sync as cooper interface.
I'm trying to "move it" to fiber.
To my surprise after I added fiber interface to sync bonding group and connected fiber cable, then disconnected Sync (cooper) ... it doesn't work ... I will play with this a little bit more.

ShaiF · ‎2024-08-07

Hi Marcyn,

Have you fetched topology and install policy after adding the slave before disconnecting eth1-Sync?

Regards,

Shai.

marcyn · ‎2024-08-07

Ah yes ... I could forget about that ... I made so many changes that it is possible (and probably this is the case).
I will try that and see if it helps - it should 🙂

But still my question remains unanswered - is this 3rd scenario possible without using on-board Sync interface (in my case cooper) ? Or on-board Sync interface is "neccessary" during "first-sync" ?
Of course in case higher/newer models with Sync interface as fiber ... there will be no issue with it.

ShaiF · ‎2024-08-07

It will be possible once we will fix the flow. you will create the bond put the 192.0.2.254 on it and it will join without the need that Sync physical interface will be connected

marcyn · ‎2024-08-08

Hi @ShaiF,

Great !
So it should be as I expected.

One more thing, regarding your comment:
"Have you fetched topology and install policy after adding the slave before disconnecting eth1-Sync?"

I just recreated in my lab EXL cluster and unfortunately I have no success with this fiber connection for sync.
Here is what I did:
1) 1st appliance was prepared, sic + policy installed
2) 2nd appliance was added into EXL cluster by using "add cluster...."
3) after 2nd appliance returned everything was fine

4) now I added eth1-01 (fiber) to sync bonding group with:
[Global] R82-s01-01> add bonding group 1024 interface eth1-01

[Global] R82-s01-01> show bonding group 1024
1_01:
Bond Configuration
xmit-hash-policy layer2
down-delay 200
primary eth1-Sync
lacp-rate Not configured
mode active-backup
up-delay 200
mii-interval 100
type sync
min-links 0
Bond Interfaces
eth1-01
eth1-Sync

2_01:
Bond Configuration
xmit-hash-policy layer2
down-delay 200
primary eth1-Sync
lacp-rate Not configured
mode active-backup
up-delay 200
mii-interval 100
type sync
min-links 0
Bond Interfaces
eth1-01
eth1-Sync

As you can see primary is still set for cooper interface (eth1-Sync).

5) now I edited gateway object in SmartConsole and get topology and next installed policy - all was fine
6) after that I disconnected eth1-Sync (cooper) ... and I saw that eth1-01 started to "flash" ... so it was switched from "backup" to "active" state.
7) couple of seconds later:

As you can see ... it is not working.
It starts to work only after I reconnect eth1-Sync (cooper).
You can see on this screenshot that eth1-01 has some Throughput and Packet Rate ... which indicates that it's fine.

So ... I see two options here:
1) I'm missing something ... but what ? It is bonding, so it should just work
2) Heart Beat is based on on-board Sync interface ... and it should be changed by R&D

I'm happy to test it more, that's why I have this lab environment.
For example different bonding modes, etc. - but I had no luck in any of modes that are possible.

ShaiF · ‎2024-08-08

Hi @marcyn ,

Checking it internally and will update next week.

Regards,

Shai.

marcyn · ‎2024-08-22

Hi @ShaiF ,

Do you have any internal feedback ?

Best
m.

ShaiF · ‎2024-08-25

Hi @marcyn,

Indeed there's issue with the Sync additional slave.

We're on it for GA.

Regards,

Shai.

emmap · ‎2024-08-07

The distance for cables is only related to the next device they are plugged into. The devices can plug into a local switch and then the switches can be connected between DCs however you want them to be. So you can have the copper sync ports connected up to a local switch, then those switches can be connected together between DCs via fibre. All 6 of the devices in your example need to be connected to the same layer 2 segment, I don't understand what you are suggesting with direct fibre connections. Directly connected to what?

marcyn · ‎2024-08-08

Yes, you are right about the distance, of course.
We can always do such thing:
DC#1, DC#2 = "cooper" sync between appliances using switch
Connection between DC#1 and DC#2 as fiber.

Regarding direct fibre connection ... I was thinking here about a scenario where we have only 2 appliances - one in DC#1 and another in DC#2 (for example another server room in the same building) - and here we could use direct fibre connection between them, without using switch.

In 99% of cases it will probably be one of those:
1) cooper between appliances in the same DC using switch, fibre between DCs
2) fiber between appliances in the same DC using switch, fibre between DCs

Because now R82 is in EA ... it is time for "academic discussion", and some "crazy ideas" ... because of that I'm trying to test some different scenarios 🙂

emmap · ‎2024-08-08

Yep, understand. And I agree that now is a great time for experimenting!

ShaiF · ‎2024-08-07

Hi Marcyn,
Indeed such scenario was not tested. We will push to fix it in GA.
Regards,

Shai.

the_rock · ‎2024-08-08

Looking forward to R82 GA 🙂

Andy

marcyn · ‎2024-11-04

Hi,

Couple of months later, when we already have R82 GA .... I can confirm that this issue is indeed fixed !
Sync works well, as I expected.

So to summarize:
If appliance has only cooper Sync interface, now we can easly add additional interface (fiber in this case) to bonding group 1024 (Sync) and now in GA everything works well with this interface.
We can even completely remove cooper eth1-Sync interface from Sync bonding group having only this fiber (or couple of such interfaces).

Thank you for this fix, now it makes sense 🙂

--
Best
m.

Are you a member of CheckMates?

R82 member unable to join cluster