Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Egor_Cherkasov
Contributor

standby cluster member fails randomly

Hello CheckMates,

Here is the issue, I have faced several times with the issue that standby member has stopped to answer icmp, http, https and ssh requests. Only reboot of a member helps.

In var/log/messages there are only 2 lines wich correlates with the time of that failover

Jun 10 17:10:46 2019 cpfw-msk-2 kernel: [fw4_1];CLUS-220201-2: Starting CUL mode because CPU usage (81%) on the remote member 1 increased above the configured threshold (80%).
Jun 10 17:10:56 2019 cpfw-msk-2 kernel: [fw4_1];CLUS-120202-2: Stopping CUL mode after 10 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec.

 

I have read some articles related to that messages on the CheckMates, however, I wonder do these messages mean a failover? And what is the possible cause?

Meantime, on the both cluster members by means of monitoring blade I do not see any high peaks - the 1st screenshot is the active member, the second is the standby.

 

0 Kudos
6 Replies
_Val_
Admin
Admin

The message does not constitute failover. In fact, the opposite. CUL feature freezes CLX status in case of high CPU utilisation, to avoid a failover.

 

There is something going on with CPU, other than that, you need to look further.

0 Kudos
Gaurav_Pandya
Advisor

Hi,

Are you getting these messages when you install policy?

Please filter type as control in smart log and check description if you are getting any hint during that time.

Capture.PNG

0 Kudos
PhoneBoy
Admin
Admin

CUL == Cluster Under Load
Looking at cpview history around the error message times might give some insights.
0 Kudos
Timothy_Hall
Champion
Champion

Agree with the others, you need to identify why CPU load is so high on the standby; the CUL is just a symptom of your problem and not the cause.  cpview in history mode (-t) and the sar command can be helpful.  If you can identify in which "space" the excessive CPU is being consumed (us/sy/ni/si/hi) that will help guide where to look next.

Any dynamic routing being used on this gateway cluster?  There are a few known causes of high CPU on the standby when that feature is in use, see sk95966 and sk105863 for more details.

 

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com
0 Kudos
Egor_Cherkasov
Contributor

Thank you colleagues!

Timothy I guess that limitations are not acceptable to ma case, because the version is R80.20.

Nevertheless, the weird thing is monitoring blade shows me no peaks at those moments.

Can the cpview history give me more information and how deep can I drill down in this history (a day, a week or more)?

 

 

Thanks in advance.

0 Kudos
G_W_Albrecht
Legend
Legend

Look into this document: sk101878: CPViewUtility

Also i found sk35466 and sk120712: Standby Cluster Member stops responding.

CCSE CCTE CCSM SMB Specialist
0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events