Commmands not executing in Management Server R80.1...

entsupport · ‎2020-01-23

Hello All,

Since last 2 days every morning we are facing very strange issue. Commands are not getting executed on management server. CPU & memory utilization is also normal.

After rebooting of management server the issue gets fixed but again next morning the issue arises.

We have collected few of the outputs during the issue as per the TAC suggestion. Attaching the same herewith.

We have logged a ticket with checkpoint TAC but they are also not able to fix this issue.

Kindly help if any troubleshooting we can perform to fix this issue

G_W_Albrecht · ‎2020-01-24

Which commands do not get executed ? What is shown in logs from the time of the issue ?

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

entsupport · ‎2020-01-24

cpview, cpstat , cpinfo, reboot etc commands are not getting executed.

[Expert@DSPMGMT:0]# tail -f /var/log/messages
Jan 24 08:58:29 2020 DSPMGMT PAM-tacplus[1819]: auth failed: 2
Jan 24 09:21:24 2020 DSPMGMT snmpd: Error: Timeout waiting for response from database server.
Jan 24 09:22:04 2020 DSPMGMT monitord[3873]: Error: Timeout waiting for response from database server.
Jan 24 09:22:24 2020 DSPMGMT snmpd: Error: Timeout waiting for response from database server.
Jan 24 09:38:01 2020 DSPMGMT PAM-tacplus[4844]: auth failed: 2
Jan 24 09:58:43 2020 DSPMGMT PAM-tacplus[6059]: auth failed: 2
Jan 24 10:49:33 2020 DSPMGMT PAM-tacplus[8861]: auth failed: 2
Jan 24 10:54:43 2020 DSPMGMT PAM-tacplus[9166]: auth failed: 2
Jan 24 10:54:49 2020 DSPMGMT PAM-tacplus[9166]: auth failed: 2
Jan 24 10:56:38 2020 DSPMGMT PAM-tacplus[9325]: auth failed: 2

PhoneBoy · ‎2020-01-24

Someone from R&D will probably have to have a look at this.
If you've opened a TAC case and provided the necessary details, it will make its way to them.

Blake_Fithen · ‎2021-04-27

Good afternoon. Was there a resolution to this? We are having identical problems with a Smart-1 5050, R80.30. The only difference is the power cords must be reseated. A warm reboot or shutdown -r does not help. Thank you for any info you can provide. I do have a case open with TAC.

the_rock · ‎2021-04-27

Cant say I had ever seen that before...what did TAC say?

Best,
Andy
"Have a great day and if its not, change it"

Blake_Fithen · ‎2021-04-27

TAC is still working on it. Trying to duplicate the problem with our configuration.

the_rock · ‎2021-04-27

Just curious, as I like to approach every problem logically. So, when you say this happened 2 days ago, anything you can think of that may had changed on mgmt server 2 or 3 days ago at all? Can you maybe check any audit logs to see if there is anything of interest when this issue occurred? One thing that comes to my mind is guidbedit, but unless someone inadvertently made changes there, I guess might not be relevant. Just to be on safe side, I would try do "install database" on the server itself.

TAC has valid idea...if they can import your config in their lab and try fix it, they can provide the solution.

Best,
Andy
"Have a great day and if its not, change it"

Blake_Fithen · ‎2021-04-27

Thanks for your interest. I don't recall saying it happened two days ago though - it started about 12 days ago and is very intermittent. We're about 14 hours total into troubleshooting, reinstalling from R80.30 ISO (twice). Patch to latest hotfix, migrate export/import, etc, push policy, all is good. Wait x amount of minutes/hours/days, then same problem.

My gut says it's hardware sensor related - or maybe ILMI related because only reseating the power cables will bring it back to the point where the GAIA portal and the dashboard are useable again. But that's just my opinion. As soon as that database timeout message appears in /var/log/messages, that's it for the portal and dashboard.

the_rock · ‎2021-04-27

Sorry, my apologies, I read original post and said "since last 2 days"...thats what I wanted to respond to, but replied to you, sorry about that. Though now that you said all that, I would agree 100% with your assessment...did you asked TAC for rma? I cant see what else they can ask you to do, except send a replacement.

Best,
Andy
"Have a great day and if its not, change it"

Blake_Fithen · ‎2021-04-27

I forgot to add I've had practically zero problems like this. For roughly 14 months it's been rock solid with regular operational rule changes, IPS, other blade updates, VPN stuff, regular hotfix updates, etc. No real negative work stopping events like this for a long time.

the_rock · ‎2021-04-27

Well, for such expensive machine like Smart-1 5050, better work way longer than 14 months 🙂

Best,
Andy
"Have a great day and if its not, change it"

Blake_Fithen · ‎2021-04-27

No worries. Agreed. Decision on RMA late tomorrow.

Are you a member of CheckMates?

Commmands not executing in Management Server R80.10