Re: standby cluster member fails randomly

Egor_Cherkasov · ‎2019-06-11

Hello CheckMates,

Here is the issue, I have faced several times with the issue that standby member has stopped to answer icmp, http, https and ssh requests. Only reboot of a member helps.

In var/log/messages there are only 2 lines wich correlates with the time of that failover

Jun 10 17:10:46 2019 cpfw-msk-2 kernel: [fw4_1];CLUS-220201-2: Starting CUL mode because CPU usage (81%) on the remote member 1 increased above the configured threshold (80%).
Jun 10 17:10:56 2019 cpfw-msk-2 kernel: [fw4_1];CLUS-120202-2: Stopping CUL mode after 10 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.

I have read some articles related to that messages on the CheckMates, however, I wonder do these messages mean a failover? And what is the possible cause?

Meantime, on the both cluster members by means of monitoring blade I do not see any high peaks - the 1st screenshot is the active member, the second is the standby.

_Val_ · ‎2019-06-11

The message does not constitute failover. In fact, the opposite. CUL feature freezes CLX status in case of high CPU utilisation, to avoid a failover.

There is something going on with CPU, other than that, you need to look further.

Gaurav_Pandya · ‎2019-06-11

Hi,

Are you getting these messages when you install policy?

Please filter type as control in smart log and check description if you are getting any hint during that time.

PhoneBoy · ‎2019-06-13

CUL == Cluster Under Load
Looking at cpview history around the error message times might give some insights.

Timothy_Hall · ‎2019-06-13

Agree with the others, you need to identify why CPU load is so high on the standby; the CUL is just a symptom of your problem and not the cause. cpview in history mode (-t) and the sar command can be helpful. If you can identify in which "space" the excessive CPU is being consumed (us/sy/ni/si/hi) that will help guide where to look next.

Any dynamic routing being used on this gateway cluster? There are a few known causes of high CPU on the standby when that feature is in use, see sk95966 and sk105863 for more details.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Egor_Cherkasov · ‎2019-06-13

Thank you colleagues!

Timothy I guess that limitations are not acceptable to ma case, because the version is R80.20.

Nevertheless, the weird thing is monitoring blade shows me no peaks at those moments.

Can the cpview history give me more information and how deep can I drill down in this history (a day, a week or more)?

Thanks in advance.

G_W_Albrecht · ‎2019-06-14

Look into this document: sk101878: CPViewUtility

Also i found sk35466 and sk120712: Standby Cluster Member stops responding.

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

Are you a member of CheckMates?

standby cluster member fails randomly