Solved: Re: dmd_mgmt process using several CPUs 100%

roethlein · ‎2024-07-11

Lately I see on an 16000 appliance running as a SGM several CPUs used to 100 % for a longer period of time (sometimes 3 or 4 hours).

Using top the most consuming process is dmd_mgmt.

I did not observe a similar behaviour on other systems and was not able to find out what this process is actually doing.

Can you guys give me a hint?

Thank you

AmitShmuel · ‎2024-07-13

From the HyperFlow SK:

When an elephant connection triggers HyperFlow, the output of the "top" and "ps" commands can show that HyperFlow user space processes consume the CPU at 100%.

This occurs because HyperFlow constantly polls its queues to handle incoming jobs. After the elephant connection closes, the output of these commands shows that the user space "us" consumption returns to usual levels because Hyperflow goes down and stops processing jobs.

To see the actual load on the CPU, use one of these:

CPView (CPU > Overview > Host)
SNMP (the OID tree 1.3.6.1.4.1.2620.1.6.7.5)
SmartConsole (right-click the Security Gateway object > click Monitor > from the left, open System Counters and click System).

This does not trigger inspection bypass because of a high CPU load.

View solution in original post

Chris_Atkinson · ‎2024-07-12

Which JHF is the environment running and anything relevant in the Hyperflow log $FWDIR/log/dmd.elg ?

CCSM R77/R80/ELITE

roethlein · ‎2024-07-12

Hi Chris,

this is R81.20 JHF 41.

In dmd.elg I am not sure what relevant information there may be hidden. The log right now without this high load seems very similar to the one yesterday afternoon.

Anything I should look for in special?

Thank you!

PhoneBoy · ‎2024-07-12

dmd_mgmt is related to Hyperflow (a gateway-side feature): https://support.checkpoint.com/results/sk/sk178070

AmitShmuel · ‎2024-07-13

From the HyperFlow SK:

When an elephant connection triggers HyperFlow, the output of the "top" and "ps" commands can show that HyperFlow user space processes consume the CPU at 100%.

This occurs because HyperFlow constantly polls its queues to handle incoming jobs. After the elephant connection closes, the output of these commands shows that the user space "us" consumption returns to usual levels because Hyperflow goes down and stops processing jobs.

To see the actual load on the CPU, use one of these:

CPView (CPU > Overview > Host)
SNMP (the OID tree 1.3.6.1.4.1.2620.1.6.7.5)
SmartConsole (right-click the Security Gateway object > click Monitor > from the left, open System Counters and click System).

This does not trigger inspection bypass because of a high CPU load.

Timothy_Hall · ‎2024-07-13

Hi @AmitShmuel

But I thought that when Hyperflow becomes active and reallocates former Firewall Worker instances to PPE_MGR or PPE_WT, that the reallocated cores were not allowed to be driven beyond 60% utilization by PPE, to ensure that the "mice" connections still remaining on the reallocated cores do not get squashed by PPE before they decay away. Questions:

1) So when you say that Hyperflow processes consume CPU at 100% I assume that load is spread across multiple cores to avoid violating the 60% per-core utilization rule to respect the existing mice connections?

2) I thought that poll mode (instead of interrupt mode) which drives the CPU to 100% was only employed if UPPAK is enabled on a Lightspeed/9000/19000/29000. Is that not correct?

3) Are the CPAS and PXL "pipeline" paths displayed by fwaccel stats -s just Hyperflow by another name?

Thanks!

Attend my Gateway Performance Optimization R81.20 course
CET (Europe) Timezone Course Scheduled for July 1-2

AmitShmuel · ‎2024-07-13

Hi Tim,

The 60% utilization enforcement refers only to the non-HyperFlow cores.

Let's examine the following 4 cores example, for simplicity:

CPU 0: SND
CPU 1: PPE_MGR
CPU 2: PPE_WT
CPU 3: FW_0 FW_1 FW_2

- FW_1 & FW_2 are both stopped FW workers (new connections will not be dispatched to them), that continue to handle existing mice connections, whos cores been reallocated for HyperFlow/PPE threads

- Looking at top, CPU 1 & 2 will show 100% utilization, as they constantly polling for MD5 and Hash jobs, similar to how UPPAK is polling for packets - both UPPAK (usim_x86) and HyperFlow (dmd_run) processes are running in poll-mode, and considered "PMD". We can see their real utilization in CPView.

- Yes, the "pipeline" paths refer to HyperFlow

Timothy_Hall · ‎2024-07-14

Thank you for the clarifications.

Attend my Gateway Performance Optimization R81.20 course
CET (Europe) Timezone Course Scheduled for July 1-2

roethlein · ‎2024-07-14

Thank you, this really helped me understanding the situation better.

So I understand the behaviour is pretty normal and we should find out which connections are the elephants.