Re: Management Server High-CPU post upgrade to R80...

Charles_Palmer · ‎2019-11-18

About a month ago, I upgrade my Smart-1 410 model Management Server from R80.10 to R80.30 and installed Take 50 immediately. I did an upgrade, not a clean install. I had a few issues with high CPU and contacted support and we ended up installing Take 76 on my management server to address a high-CPU issue with Java. This seems to have corrected the high-CPU from Java issue. I still had high-CPU from postgres processes. After a few hours, those settled down and it operated normally the rest of the week. On Saturday, I had the processors spike to near-100% and stay that way until late Monday/early Tuesday and then it cleared up again. It was the postgres processes that were consuming the processor. While observing it, postgres process consume the processor for about 45 minutes out of every hour with a break of about 15 minutes. This is enough to have my Indeni monitoring put the management server into cooldown and start monitoring it again only to have it spike while in cool down and therefore Indeni stops its normal interrogation and limits it to only CPU and Memory monitoring. I have tried to address this with support and they don't have any further guidance for me thus far. This is the third weekend since my upgrade where this process has happened.

This screams of some scheduled process that is running that takes high-CPU, but I don't know what it might be. I may have just reached the end of the cycle for this week as it has been almost 20 minutes since the CPU stopped being high this time. But it generally has been 2-3 days of mostly high-CPU on my management server starting sometime on Saturday.

Thank you for any guidance or assistance in what I should check to figure out what is causing this high-CPU condition each week.

PhoneBoy · ‎2019-11-21

I know there are some periodic processes that can consume a lot of CPU but at low priority.
Which means: if something else needs the CPU, it backs off, but if nothing needs the CPU, it will use it.
That might be the case here, but without seeing output of ps -auxwww, I'm not 100% sure that's the case here.

Charles_Palmer · ‎2019-11-21

Thank you for your reply.

I will keep an eye on it and if the problem happens this weekend for the 4th time, I will run "ps -auxwww" and save the output to a file and I can upload it for your review at that point. You may be right from the standpoint that it is low priority because I haven't notices any performance issues that had me hunting. Indeni reported the sustained high-CPU which put it into only monitor CPU and Memory mode repeatedly that brought it to my attention.

Is there anything else I should collect besides that (looks like that is a pretty comprehensive chunk of data dumped already) I should run as well?

Chris_Atkinson · ‎2019-11-21

By way of background are SmartEvent and or Compliance Blade enabled on the Smart-1 410, how about any OPSEC connections?

CCSM R77/R80/ELITE

Charles_Palmer · ‎2019-11-21

Yes, SmartEvent Server and Correlation Unit are both enabled as well as Compliance. Additionally, Network Policy Management, Logging & Status/Identity Logging and User Directory are also checked. Endpoint Policy Management is not check because we don't use it. Workflow is greyed out and uncheck while Provisioning is greyed out but checked.

Charles_Palmer · ‎2019-11-25

I have the high-CPU situation this morning (though not as extreme as it was the previous three weekends) and I ran the ps -auxwww as requested. I have it saved to a file. Should I just post the contents into one of these messages? If not, how shall I get the results to you?

Charles_Palmer · ‎2019-11-25

I missed the little paperclip until I had already clicked the Post button.

Timothy_Hall · ‎2019-11-25

Looking at your ps output there is some low-priority SOLR log indexing going on, but the number of postgres: related processes and their CPU utilization looks far too high for the resource profile assigned by default to a Smart-1 410. Not sure if spawned postgres processes are getting "stuck" or what (Parent process IDs are not shown in your output) but I'd say a TAC case is definitely in order here as that doesn't look right to me.

Are you using any third-party log analysis tools that might account for the postgres activity?

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization

Charles_Palmer · ‎2019-11-25

Unfortunately, I have already tried to address this with TAC twice and they aren't seeing any problem or at least not any explanation for the problem. I do not have any third-party log analysis running at current. I do have Indeni doing performance and best practices monitoring, but it doesn't mess with the logs that I am aware of. Indeni is what tipped me off to the problem initially because the first 3 weekends, my CPU was pegged for 45+ minutes out of every hour which had Indeni going to cooldown monitoring where it only checks CPU and Memory until such time as the CPU is not pegged. I didn't get the same email explosion from Indeni this weekend that I did the previous three weekends (it was emailing me about hourly about the issue on the previous weekends and I only got one notification this month). This is telling me that while it is high right now, it isn't staying high for most of the hour like it was before. Maybe whatever it was is settling down now that I am a month into my upgrade from 80.10.

Chris_Atkinson · ‎2019-11-25

If you have the option/ability to run without compliance blade enabled for a period it's something you could try as a method of isolating the symptoms further.

Additionally there are further SmartEvent CPU optimizations in JHF T107 (ongoing).

If the load persists longer term without resolution you may need to look at distributing roles (SmartEvent) onto other VMs or hardware to alleviate.

Refer also: R80.X Security Management Performance Tuning Guide

CCSM R77/R80/ELITE

Are you a member of CheckMates?

Management Server High-CPU post upgrade to R80.30 from R80.10