R80.40 management server performance issues

Bill_Ng · ‎2020-11-19

All,

Since we did an in place upgrade from R80.10 to R80.40 JHF 83 on our management station we are seeing decreased performance on our management station. The issue seems across the board. We see slowness in the smartconsole and ssh sessions. Both seem very laggy since the upgrade. We did reboot the management server since the upgrade. It seemed fine for a couple of days, but got worse since. I see that java is taking up a ton of CPU see attached top. Anyone see or experience this? I do have TAC case opened, but wanted to check if anyone out there has seen this.

Thanks,

Bill

PhoneBoy · ‎2020-11-21

The process that is taking up all the CPU is niced, so it will use CPU only when available.
I suspect the indexing going on is related to the upgrade and will subside shortly.

Dror_Aharony · ‎2020-11-22

Hi Bill,

Can you please send us your 'SmartEventCollectLogs' output? or point to TAC case with this info?

Bill_Ng · ‎2020-11-23

Hi Dror,

Pardon my ignorance. How do I get the SmartEventCollectLogs output? Would I need to run 'SmartEventSetDebugLevel all trace' first and for how long?

Dror_Aharony · ‎2020-11-23

No.
Once load is again high, simply run it as is via expert mode CLI:
SmartEventCollectLogs
and attach output here and to TAC ticket.

Timothy_Hall · ‎2020-11-22

As Phoneboy said the indexing processes are running with minimum CPU and IO priority and will get shoved out of the way when other work needs to get done. However the very high wio values on some cores and not others is a bit concerning, are you using RAID for your disks? Is the RAID array in Optimal state? (use expert mode raid_diagnostic command)

You may want to check this recent SK, which is for gateways but sounds eerily similar to your situation: sk170560: High CPU, high IOWait utilization on random CPUs, and delayed CLI outputs on various comma...

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Bill_Ng · ‎2020-11-23

Hi Phoneboy and Tim,

Thanks for your inputs. I will check out the SK as well. I checked the RAID on the disks. The RAID-5 state is optimal and all drives checked out fine. Our management server is an open server doing double duty as a management server and logging server at the same time. Our logging partition is 8TB large and consumes about 6TB. The servers were upgraded about 2 weeks ago. A reboot seems to help it for several days after. We recently rebooted it and keeping an eye out on it for performance issues.

Timothy_Hall · ‎2020-11-23

Hmm even with RAID in an optimal state, it is starting to smell like your disk path is oversubscribed a bit. Please post output of these two commands to display logging and indexing load:

cpstat mg -f log_server

cpstat mg -f indexer

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Bill_Ng · ‎2020-11-25

# cpstat mg -f log_server

Log Receive Rate: 9355

Log Receive Rate Peak: 61544

Log Receive Rate Last 10 Minutes: 13169

Log Receive Rate Last Hour: 12465

Henrik_Noerr1 · ‎2020-11-25

We were hit by the fix 'NEW: Solr server process is restarted automatically if it is not responsive for a long time.' in r80.30 take 219 - had to downgrade to take 217

We saw the same issue you are describing. The same "fix" were implemented in r80.40 take 78 - perhaps you are hit by the same issue? tail cpm.elg and you will see it crashing constantly.

Or upgrade to take 87 where the issue is fixed. See sk170634

take 87 just went GA - so I wouldn't recommend it.

Best regards,

Henrik

Bill_Ng · ‎2020-11-25

Hi Henrik,

I did the tail cpm.elg and see the Solr restarting. I will post this info in our case as well.

tail -f cpm.elg | grep -E 'Stopping|Starting'

25/11/20 07:48:42,784 INFO fts.solr.SolrServerRunner [qtp-536021905-133083]: Stopping Solr with $MDS_TEMPLATE/scripts/solr_stop.sh script
25/11/20 07:48:42,787 INFO fts.solr.SolrServerRunner [qtp-536021905-133083]: Starting Solr server with command: /opt/CPshrd-R80.40/jre_64/bin/java -D_CPM_SOLR=TRUE -Xmx8192m -Xms64m -Xgcpolicy:optavgpause -Djava.io.tmpdir=/opt/CPsuite-R80.40/fw1/tmp -Xaggressive -Xshareclasses:none -Xdump:heap:events=gpf+user -Xdump:directory=/var/log/dump/usermode -Xdump:tool:none -Xdump:tool:events=gpf+abort+traceassert+corruptcache,priority=1,range=1..0,exec=javaCompress.sh CPM_SOLR %pid -Xdump:tool:events=systhrow,filter=java/lang/OutOfMemoryError,priority=1,range=1..0,exec=javaCompress.sh CPM_SOLR %pid -Xdump:tool:events=throw,filter=java/lang/OutOfMemoryError,priority=1,exec=kill -9 %pid -Dsolr.solr.home=/opt/CPsuite-R80.40/fw1/Solr/solr/ -DNGM.SOLR.LOG.DIR=/opt/CPsuite-R80.40/fw1/log -Djava.util.logging.config.file=/opt/CPsuite-R80.40/fw1/Solr/etc/logging.properties -DSTART=/opt/CPsuite-R80.40/fw1/Solr/start.config -Djetty.home=/opt/CPsuite-R80.40/fw1/Solr/ -DSTOP.KEY=checkpointkey -DSTOP.PORT=8982 -Dpath=/opt/CPsuite-R80.40/fw1/cpm-server/java_is.jar:/opt/CPsuite-R80.40/fw1/cpm-server/java_sic.jar:/opt/CPshrd-R80.40/jars/jetty_assist.jar -jar /opt/CPsuite-R80.40/fw1/Solr/start.jar
25/11/20 07:49:06,916 INFO fts.solr.SolrServerRunner [qtp-536021905-206009]: Stopping Solr with $MDS_TEMPLATE/scripts/solr_stop.sh script
25/11/20 07:49:06,918 INFO fts.solr.SolrServerRunner [qtp-536021905-206009]: Starting Solr server with command: /opt/CPshrd-R80.40/jre_64/bin/java -D_CPM_SOLR=TRUE -Xmx8192m -Xms64m -Xgcpolicy:optavgpause -Djava.io.tmpdir=/opt/CPsuite-R80.40/fw1/tmp -Xaggressive -Xshareclasses:none -Xdump:heap:events=gpf+user -Xdump:directory=/var/log/dump/usermode -Xdump:tool:none -Xdump:tool:events=gpf+abort+traceassert+corruptcache,priority=1,range=1..0,exec=javaCompress.sh CPM_SOLR %pid -Xdump:tool:events=systhrow,filter=java/lang/OutOfMemoryError,priority=1,range=1..0,exec=javaCompress.sh CPM_SOLR %pid -Xdump:tool:events=throw,filter=java/lang/OutOfMemoryError,priority=1,exec=kill -9 %pid -Dsolr.solr.home=/opt/CPsuite-R80.40/fw1/Solr/solr/ -DNGM.SOLR.LOG.DIR=/opt/CPsuite-R80.40/fw1/log -Djava.util.logging.config.file=/opt/CPsuite-R80.40/fw1/Solr/etc/logging.properties -DSTART=/opt/CPsuite-R80.40/fw1/Solr/start.config -Djetty.home=/opt/CPsuite-R80.40/fw1/Solr/ -DSTOP.KEY=checkpointkey -DSTOP.PORT=8982 -Dpath=/opt/CPsuite-R80.40/fw1/cpm-server/java_is.jar:/opt/CPsuite-R80.40/fw1/cpm-server/java_sic.jar:/opt/CPshrd-R80.40/jars/jetty_assist.jar -jar /opt/CPsuite-R80.40/fw1/Solr/start.jar
25/11/20 07:49:14,937 INFO fts.solr.SolrServerRunner [qtp-536021905-21461]: Stopping Solr with $MDS_TEMPLATE/scripts/solr_stop.sh script
25/11/20 07:49:14,940 INFO fts.solr.SolrServerRunner [qtp-536021905-21461]: Starting Solr server with command: /opt/CPshrd-R80.40/jre_64/bin/java -D_CPM_SOLR=TRUE -Xmx8192m -Xms64m -Xgcpolicy:optavgpause -Djava.io.tmpdir=/opt/CPsuite-R80.40/fw1/tmp -Xaggressive -Xshareclasses:none -Xdump:heap:events=gpf+user -Xdump:directory=/var/log/dump/usermode -Xdump:tool:none -Xdump:tool:events=gpf+abort+traceassert+corruptcache,priority=1,range=1..0,exec=javaCompress.sh CPM_SOLR %pid -Xdump:tool:events=systhrow,filter=java/lang/OutOfMemoryError,priority=1,range=1..0,exec=javaCompress.sh CPM_SOLR %pid -Xdump:tool:events=throw,filter=java/lang/OutOfMemoryError,priority=1,exec=kill -9 %pid -Dsolr.solr.home=/opt/CPsuite-R80.40/fw1/Solr/solr/ -DNGM.SOLR.LOG.DIR=/opt/CPsuite-R80.40/fw1/log -Djava.util.logging.config.file=/opt/CPsuite-R80.40/fw1/Solr/etc/logging.properties -DSTART=/opt/CPsuite-R80.40/fw1/Solr/start.config -Djetty.home=/opt/CPsuite-R80.40/fw1/Solr/ -DSTOP.KEY=checkpointkey -DSTOP.PORT=8982 -Dpath=/opt/CPsuite-R80.40/fw1/cpm-server/java_is.jar:/opt/CPsuite-R80.40/fw1/cpm-server/java_sic.jar:/opt/CPshrd-R80.40/jars/jetty_assist.jar -jar /opt/CPsuite-R80.40/fw1/Solr/start.jar
25/11/20 07:49:47,380 INFO fts.solr.SolrServerRunner [qtp-536021905-176307]: Stopping Solr with $MDS_TEMPLATE/scripts/solr_stop.sh script
25/11/20 07:49:47,383 INFO fts.solr.SolrServerRunner [qtp-536021905-176307]: Starting Solr server with command: /opt/CPshrd-R80.40/jre_64/bin/java -D_CPM_SOLR=TRUE -Xmx8192m -Xms64m -Xgcpolicy:optavgpause -Djava.io.tmpdir=/opt/CPsuite-R80.40/fw1/tmp -Xaggressive -Xshareclasses:none -Xdump:heap:events=gpf+user -Xdump:directory=/var/log/dump/usermode -Xdump:tool:none -Xdump:tool:events=gpf+abort+traceassert+corruptcache,priority=1,range=1..0,exec=javaCompress.sh CPM_SOLR %pid -Xdump:tool:events=systhrow,filter=java/lang/OutOfMemoryError,priority=1,range=1..0,exec=javaCompress.sh CPM_SOLR %pid -Xdump:tool:events=throw,filter=java/lang/OutOfMemoryError,priority=1,exec=kill -9 %pid -Dsolr.solr.home=/opt/CPsuite-R80.40/fw1/Solr/solr/ -DNGM.SOLR.LOG.DIR=/opt/CPsuite-R80.40/fw1/log -Djava.util.logging.config.file=/opt/CPsuite-R80.40/fw1/Solr/etc/logging.properties -DSTART=/opt/CPsuite-R80.40/fw1/Solr/start.config -Djetty.home=/opt/CPsuite-R80.40/fw1/Solr/ -DSTOP.KEY=checkpointkey -DSTOP.PORT=8982 -Dpath=/opt/CPsuite-R80.40/fw1/cpm-server/java_is.jar:/opt/CPsuite-R80.40/fw1/cpm-server/java_sic.jar:/opt/CPshrd-R80.40/jars/jetty_assist.jar -jar /opt/CPsuite-R80.40/fw1/Solr/start.jar
25/11/20 07:49:48,261 INFO fts.solr.SolrServerRunner [qtp-536021905-161867]: Stopping Solr with $MDS_TEMPLATE/scripts/solr_stop.sh script
25/11/20 07:49:48,263 INFO fts.solr.SolrServerRunner [qtp-536021905-161867]: Starting Solr server with command: /opt/CPshrd-R80.40/jre_64/bin/java -D_CPM_SOLR=TRUE -Xmx8192m -Xms64m -Xgcpolicy:optavgpause -Djava.io.tmpdir=/opt/CPsuite-R80.40/fw1/tmp -Xaggressive -Xshareclasses:none -Xdump:heap:events=gpf+user -Xdump:directory=/var/log/dump/usermode -Xdump:tool:none -Xdump:tool:events=gpf+abort+traceassert+corruptcache,priority=1,range=1..0,exec=javaCompress.sh CPM_SOLR %pid -Xdump:tool:events=systhrow,filter=java/lang/OutOfMemoryError,priority=1,range=1..0,exec=javaCompress.sh CPM_SOLR %pid -Xdump:tool:events=throw,filter=java/lang/OutOfMemoryError,priority=1,exec=kill -9 %pid -Dsolr.solr.home=/opt/CPsuite-R80.40/fw1/Solr/solr/ -DNGM.SOLR.LOG.DIR=/opt/CPsuite-R80.40/fw1/log -Djava.util.logging.config.file=/opt/CPsuite-R80.40/fw1/Solr/etc/logging.properties -DSTART=/opt/CPsuite-R80.40/fw1/Solr/start.config -Djetty.home=/opt/CPsuite-R80.40/fw1/Solr/ -DSTOP.KEY=checkpointkey -DSTOP.PORT=8982 -Dpath=/opt/CPsuite-R80.40/fw1/cpm-server/java_is.jar:/opt/CPsuite-R80.40/fw1/cpm-server/java_sic.jar:/opt/CPshrd-R80.40/jars/jetty_assist.jar -jar /opt/CPsuite-R80.40/fw1/Solr/start.jar

Dror_Aharony · ‎2020-11-26

Your log-rate is quite high: ~10,000 logs/sec with peaks of 60K logs/sec.
Please send cpm.elg's & the 'SmartEventCollectLogs' output here (TAC ticket #?).
Also, does it continue to happen consistently, both the CPM_Solr restarts & the high-load causing slowness of Mgmt. server?

Bill_Ng · ‎2020-11-27

Hi Dror,

I have upload the cpm.elg from today and SmartEventCollectLogs from today to our case. I put these in the incoming folder. The case is 6-0002423430. The load may not be as high since we are on holiday today.

Dror_Aharony · ‎2020-12-02

I cannot find your case no. for some reason.
if load is higher today, then please re-generate & re-upload.
Please also send it directly to my Email: drora@checkpoint.com.

thanks.

Bill_Ng · ‎2020-12-04

The load is higher today. I am in process of getting the info to you. I will upload to case again and email them to you as well when completed.

Are you a member of CheckMates?

R80.40 management server performance issues