High CPU Spikes

Danny_Olson · ‎2025-04-16

hi there, I thought it might be cool to see what the community thought on this. I am on R81.20 with take 98

I have been battling some CPU spikes on a fairly utilized 7K appliance for some time. I was trying to see how i can track this down and get a feel for how to handle some problem traffic flows that are happening. I have 27 cores running ~35% at all times during the business day. I am running a lot of blades on the wall (attached screenshots) and using the optimized profile.

for example i am seeing logs similar to this in /var/log/spike_detective.log today.

spike info: type: thread, thread id: 42026, thread name: fwk0_26, start time: 16/04/25 08:42:20, spike duration (sec): 942, initial cpu usage: 100, average cpu usage: 100, perf taken: 0
spike info: type: thread, thread id: 42795, thread name: fw_full, start time: 16/04/25 09:58:38, spike duration (sec): 5, initial cpu usage: 100, average cpu usage: 100, perf taken: 0
spike info: type: cpu, cpu core: 9, top consumer: fwk0_12, start time: 16/04/25 13:12:29, spike duration (sec): 18, initial cpu usage: 100, average cpu usage: 87, perf taken: 1
spike info: type: thread, thread id: 42019, thread name: fwk0_19, start time: 16/04/25 13:12:47, spike duration (sec): 1392, initial cpu usage: 100, average cpu usage: 100, perf taken: 1
spike info: type: thread, thread id: 42025, thread name: fwk0_25, start time: 16/04/25 13:35:53, spike duration (sec): 1119, initial cpu usage: 100, average cpu usage: 100, perf taken: 0

so looking at fwk0_25 it's telling me that this CPU was pegged at 100 percent for close to 20 minutes today.

I ran fw ctl multik print_heavy_conn to look further and found this. in relation to fwk0_25 and went to focus on the timestamp of 13:35:53

[fw_25]; Conn: x.x.65.57:53940 -> x.x.69.96:1234 IPP 6; Instance load: 61%; Connection instance load: 76%; StartTime: 16/04/25 13:14:14; Duration: 70; IdentificationTime: 16/04/25 13:14:21; Service: 6:1234; Total Packets: 3025234; Total Bytes: 4357408656; Num of PM jobs: 0; Num of MD5 jobs: 0; Num of buffers sent to Main: 0; PPE Heavy Duration: 0; Async Duration: 0; Async Bytes: 0; Async Transitions: 0; Async Avg Load: 0%; Sync Avg Load: 67%;
[fw_25]; Conn: x.x.64.27:61924 -> x.x.69.96:1234 IPP 6; Instance load: 63%; Connection instance load: 73%; StartTime: 16/04/25 13:35:46; Duration: 23; IdentificationTime: 16/04/25 13:35:49; Service: 6:1234; Total Packets: 1813629; Total Bytes: 2564973893; Num of PM jobs: 1351397; Num of MD5 jobs: 0; Num of buffers sent to Main: 0; PPE Heavy Duration: 20; Async Duration: 17; Async Bytes: 2033686855; Async Transitions: 0; Async Avg Load: 84%; Sync Avg Load: 78%;
[fw_25]; Conn: x.x.47.75:47610 -> x.x.3.180:80 IPP 6; Instance load: 68%; Connection instance load: 94%; StartTime: 15/04/25 21:38:45; Duration: 1; IdentificationTime: 15/04/25 21:38:49; Service: 6:80; Total Packets: 44301; Total Bytes: 61654694; Num of PM jobs: 0; Num of MD5 jobs: 0; Num of buffers sent to Main: 0; PPE Heavy Duration: 0; Async Duration: 0; Async Bytes: 0; Async Transitions: 0; Async Avg Load: 0%; Sync Avg Load: 68%;
[fw_25]; Conn: x.x.64.27:61944 -> x.x.69.96:1234 IPP 6; Instance load: 61%; Connection instance load: 74%; StartTime: 16/04/25 13:36:24; Duration: 14; IdentificationTime: 16/04/25 13:36:27; Service: 6:1234; Total Packets: 1110687; Total Bytes: 1570905924; Num of PM jobs: 1045543; Num of MD5 jobs: 0; Num of buffers sent to Main: 0; PPE Heavy Duration: 13; Async Duration: 13; Async Bytes: 1484700978; Async Transitions: 0; Async Avg Load: 83%; Sync Avg Load: 61%;
[fw_25]; Conn: x.x.65.57:54293 -> x.x.69.96:1234 IPP 6; Instance load: 61%; Connection instance load: 72%; StartTime: 16/04/25 13:23:08; Duration: 27; IdentificationTime: 16/04/25 13:23:14; Service: 6:1234; Total Packets: 1093095; Total Bytes: 1573465832; Num of PM jobs: 0; Num of MD5 jobs: 0; Num of buffers sent to Main: 0; PPE Heavy Duration: 0; Async Duration: 0; Async Bytes: 0; Async Transitions: 0; Async Avg Load: 0%; Sync Avg Load: 67%;

I attached a screenshot of Cpview during the time to see what the CPU usage is, and it looks like that CPU 25 is healthy? I went into the I/O (see attached) and it's some CRAZY high requests on the file system as i understand it?

I also took a screenshot of the the IO nested in-between the following spikes at 11:00AM today, and the I/O is drastically different by comparison. I also attached what the distribution looks like. normally i am seeing a healthy ratio for accelerated connections of like 95/5, but something is driving up the averages for F2F packets to 80/20 which is not as ideal.

spike info: type: cpu, cpu core: 24, top consumer: fwk0_13, start time: 16/04/25 09:40:00, spike duration (sec): 6, initial cpu usage: 83, average cpu usage: 83, perf taken: 1
spike info: type: cpu, cpu core: 31, top consumer: fw_full, start time: 16/04/25 09:58:38, spike duration (sec): 11, initial cpu usage: 85, average cpu usage: 74, perf taken: 1
spike info: type: cpu, cpu core: 7, top consumer: fwk0_16, start time: 16/04/25 13:07:03, spike duration (sec): 70, initial cpu usage: 85, average cpu usage: 80, perf taken: 0
spike info: type: cpu, cpu core: 8, top consumer: fwk0_14, start time: 16/04/25 13:08:42, spike duration (sec): 41, initial cpu usage: 84, average cpu usage: 57, perf taken: 0

some days/weeks are fine, and other times at random users will have a lot of issues. I believe it's tied to what users are doing or not doing, and the unique traffic patterns on our network.

so looking at this flow.

[fw_25]; Conn: x.x.64.27:61924 -> x.x.69.96:1234 IPP 6; Instance load: 63%; Connection instance load: 73%; StartTime: 16/04/25 13:35:46; Duration: 23; IdentificationTime: 16/04/25 13:35:49; Service: 6:1234; Total Packets: 1813629; Total Bytes: 2564973893; Num of PM jobs: 1351397; Num of MD5 jobs: 0; Num of buffers sent to Main: 0; PPE Heavy Duration: 20; Async Duration: 17; Async Bytes: 2033686855; Async Transitions: 0; Async Avg Load: 84%; Sync Avg Load: 78%;

so am i correct in saying that in 23 seconds I saw almost 2 million packets in a 2.5 GB flow. do i have a somewhat correct potential assumption that this an big I/O issue?

I think i can just disable inspect via fast_accel and or re-route this traffic around the firewall.

I wanted to see what some smarter than I think about this, if i am on the right track, or if there is a more economical way to go about doing this? right now it seems like a big game of whack a mole. fast_accel is not mean to be used liberally and it's limited to 24 rules, and i am not sure that if i had just the firewall blade that i would not be seeing the same issue, so i have to re-route this traffic around the firewall? i have even considered disabling blades to help lower the consumption.

PhoneBoy · ‎2025-04-17

Those bytes I think refer to the connection as it existed when it was logged.
That doesn't mean you got 2.5 GB of traffic in 23 seconds, but it does indicate this particular flow caused a CPU spike.

fast_accel is a quick way to ensure the traffic stays in the Fastpath.
However, you can also achieve the same thing by ensuring the traffic doesn't match rules that can't be fully accelerated (use a rule early in the rulebase that matches simple TCP services and set an appropriate Exclusion in Threat Prevention).

the_rock · ‎2025-04-17

I had customer with similar issue recently, mind you 6k series appliances, same jumbo and once they installed jumbo 99, issue went away. Funny enough, that was NOT the main reason for latest jumbo, rather VPN fixes, but I do see there also quite few CPU resolved issues as well.

Andy

Best,
Andy

Timothy_Hall · ‎2025-04-17

Firewall worker instance 25 (fw_25) is not usually affined to CPU core #25. The worker instances are numbered in ascending order but starting from the highest core number in *descending* order. So on a 32-core firewall (numbered 0-31), core 31 is instance 0, core 30 is instance 1, etc. Use fw ctl affinity -l -r -v to see the mapping of instances (fw_X) to cores.

Yes you saw "23 seconds I saw almost 2 million packets in a 2.5 GB flow" because it was boosted across many cores by Hyperflow, so that load was not just happening on just one core. I can tell it was boosted because the "Num of PM jobs" is nonzero indicating a boost of traffic subject to IPS inspection. Number of MD5 jobs is zero so this flow must not have been subject to Anti-bot/Anti-virus or Sandblast, I assume this flow was between internal networks? So many cores all getting busy is probably expected behavior due to Hyperflow, if you don't like this behavior and want to keep elephants from consuming resources on more than one core you can disable Hyperflow, however it will not consume so many core resources that the entire system is noticeably affected.

Those mdX numbers don't seem realistic to me, unless the firewall is logging locally or it is short of memory. Please provide output of free -m. Does top show a Waiting for I/0 (wa) percentage of >5% during busy periods?

20% slowpath is not the end of the world (ideal is <10%), use command fw tab -t connections -z to view all connections in the slowpath and why they are there. All of the possible slowpath reasons I know of were revealed in my Be Your Own TAC Part Deux Techtalk.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Are you a member of CheckMates?

High CPU Spikes