- CheckMates
- :
- Products
- :
- General Topics
- :
- standby cluster member fails randomly
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Are you a member of CheckMates?
×- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
standby cluster member fails randomly
Hello CheckMates,
Here is the issue, I have faced several times with the issue that standby member has stopped to answer icmp, http, https and ssh requests. Only reboot of a member helps.
In var/log/messages there are only 2 lines wich correlates with the time of that failover
Jun 10 17:10:46 2019 cpfw-msk-2 kernel: [fw4_1];CLUS-220201-2: Starting CUL mode because CPU usage (81%) on the remote member 1 increased above the configured threshold (80%).
Jun 10 17:10:56 2019 cpfw-msk-2 kernel: [fw4_1];CLUS-120202-2: Stopping CUL mode after 10 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.
I have read some articles related to that messages on the CheckMates, however, I wonder do these messages mean a failover? And what is the possible cause?
Meantime, on the both cluster members by means of monitoring blade I do not see any high peaks - the 1st screenshot is the active member, the second is the standby.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The message does not constitute failover. In fact, the opposite. CUL feature freezes CLX status in case of high CPU utilisation, to avoid a failover.
There is something going on with CPU, other than that, you need to look further.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Are you getting these messages when you install policy?
Please filter type as control in smart log and check description if you are getting any hint during that time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Looking at cpview history around the error message times might give some insights.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Agree with the others, you need to identify why CPU load is so high on the standby; the CUL is just a symptom of your problem and not the cause. cpview in history mode (-t) and the sar command can be helpful. If you can identify in which "space" the excessive CPU is being consumed (us/sy/ni/si/hi) that will help guide where to look next.
Any dynamic routing being used on this gateway cluster? There are a few known causes of high CPU on the standby when that feature is in use, see sk95966 and sk105863 for more details.
March 27th with sessions for both the EMEA and Americas time zones
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you colleagues!
Timothy I guess that limitations are not acceptable to ma case, because the version is R80.20.
Nevertheless, the weird thing is monitoring blade shows me no peaks at those moments.
Can the cpview history give me more information and how deep can I drill down in this history (a day, a week or more)?
Thanks in advance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Look into this document: sk101878: CPViewUtility
Also i found sk35466 and sk120712: Standby Cluster Member stops responding.
