Unexpected crash of ClusterXL active member

Matlu · ‎2023-09-15

Hello, Folks.

Today our client's Main Cluster had a problem.

At about 10:00am (GMT -5), the client was without services in general (Internet, Publishing, Communication between internal segments).
Practically, the active member of the Cluster "DIED", without any cause.

The client tried to switch the ClusterXL order, with the command "clusterXL_admin down", but the command did not work, and had to restart the computer.

They have already had similar bad experiences with this Cluster, and it seems to be a problem of the hardware that was sold (Appliance 6000).

Is it possible to "detect" what was the root-cause, by which simply, the equipment, stopped working, and caused this disaster for the customer?

Best regards.

Check Point: R81.10 with JHF Take 95

PhoneBoy · ‎2023-09-15

If the system crashed, there is going to be a vmcore somewhere.
I would highly recommend engaging with the TAC on this as, if it's a hardware failure, an RMA may be required.

the_rock · ‎2023-09-16

TAC may suggest to upgrade to R81.20, but hard to say if that would solve anything. As @PhoneBoy said, sounds like vmcore was generated, so that would need to be investigated further.

Andy

Best,
Andy
"Have a great day and if its not, change it"

JozkoMrkvicka · ‎2023-09-16

as soon as possible, once device is "alive", collect all logs present on a device ( dmesg, var/log/messages) and cpinfo. TAC may be able to spot what went wrong before all logs are overwritten with newer ones.

Kind regards,
Jozko Mrkvicka

the_rock · ‎2023-09-17

Very good point @JozkoMrkvicka

Best,
Andy
"Have a great day and if its not, change it"

Are you a member of CheckMates?

Unexpected crash of ClusterXL active member