Cluster dead timeout SK93454 - 3 or 30?

Kaspars_Zibarts · ‎2021-11-23

Hi! Just wondered if you could check your gateways and see the value of this kernel parameter from sk93454

fw ctl get int fwha_dead_timeout_multiplier
fwha_dead_timeout_multiplier = 3

The reason I'm asking is that SK article says it should be 30 whereas we see 3 and we have seen very strange cluster failovers - for example rebooting standby cluster member resulted in full failover as active cluster member was reporting lost CCP packets. I start to suspect that this kernel parameter is set too low (by mistake /typo) so instead of having 3sec cluster dead timeout we actually have 300ms!

We are running R80.40 T120

Wolfgang · ‎2021-11-23

@Kaspars_Zibarts checked on different systems all shows "3"

R80.10, R80.40, R81, R81.10 and VSX R80.10, R80.40

the_rock · ‎2021-11-23

Im so glad you actually brought this up...as soon as I read it, I recall once working with customer on R80.20 cluster and escalation guy in TAC said to change this value to 30 and when we pressed him why, as we saw value 3 on different versions, he really could not explain it, said would open R&D task and absolutely nothing came out of it. I mean, I like to think of myself as pretty open minded person and willing to try things when stuff is broken, but definitely not someone who wants to blindly change things without any logical reasoning. Maybe someone from CP can chime in and give us a reason.

Best,
Andy
"Have a great day and if its not, change it"

Kaspars_Zibarts · ‎2021-11-23

@_Val_ - do you think you could ask internally pls? 🙂

_Val_ · ‎2021-11-23

@Kaspars_Zibarts What is the actual question you want me to ask?

_Val_ · ‎2021-11-23

Answering the original question,

The mentioned SK is describing the recommended change, and not the default settings for the mentioned parameter. The way I read it, it should say two things: default parameter (which is 3 HTUs) and recommended one (which is 30).

By default, CCP sends 3 hello per second, and losing one causes cluster to check connectivity and go into failover. 3 seconds equal to 9 to 10 CCP frames lost, and may affect production traffic by delaying it on the failed previously active cluster member.

That said, I am checking with SK owners what they tried to say 🙂

Kaspars_Zibarts · ‎2021-11-23

Thanks Val!

Are we looking at the same SK? I see it very clearly stated as 3secs by default

If it's set to 3 (=300ms) and CCP hello interval is 333ms (1/3s) then there's a high probability that Hello will get missed. To allow one CCP Hello to be missed the timer should be just under (2 x 1/3s) or 599ms.

That's if I understood the logic correctly. Or there are some other CCP timers. And this is where it gets tricky as there are bunch of very old articles and many kernel adjustable timers do not exist in R80.40. So it would be nice to have an updated SK regarding CCP timer functionality 🙂

_Val_ · ‎2021-11-24

Yes we do, hence I said, it is badly worded at the beginning, and I am already taking it with the owners. It should say, AFAIK, "Cluster dead interval is 0.3 second, by default"

Now, see the rest of my explanation, all clicks into place 🙂

Kaspars_Zibarts · ‎2021-11-24

Great! Thanks Val!

But then it begs the same question: if timeout is 300ms and interval between CCP hello is 333ms - then timeout is too short as it can start counting exactly after one Hello is sent and will expire before next Hello arrives

_Val_ · ‎2021-11-24

No, it is .3 seconds of additional wait for the missing packet.

genisis__ · ‎2021-11-23

# fw ctl get int fwha_dead_timeout_multiplier
fwha_dead_timeout_multiplier = 3

Running JHFA 125 on the device I ran the command on.

Pablo_Munoz · ‎2023-04-24

I know this is an old thread, but this may be helpful for future readers.

The SK says the recommended value is 30 "HTU", while the value we configure for fwha_dead_timeout_multiplier is just a multiplier (not HTUs).

How this parameter works is it uses the value we configure (3 or any other value) and multiplies it by 10 HTUs (each HTU is 100ms). So the timeout in this case becomes 3 x 10 (HTUs) = 3 seconds. This is the default AND recommended value. You can also find more information about this parameter in sk92723.

Both lines written in the SK are correct:

- Cluster dead interval is 3 seconds, by default.

- Recommended value for both kernel parameters is 30 (HTU).

That is just to understand the inner logic about what is written in the SK. Bottom line is that whatever value we configure for this parameter will end up being the number of 'seconds' for this timeout (because of how this value is anyway multiplied by 10 HTUs in the background).

I hope this helps and makes sense.

Are you a member of CheckMates?

Cluster dead timeout SK93454 - 3 or 30?