hi there, I thought it might be cool to see what the community thought on this. I am on R81.20 with take 98
I have been battling some CPU spikes on a fairly utilized 7K appliance for some time. I was trying to see how i can track this down and get a feel for how to handle some problem traffic flows that are happening. I have 27 cores running ~35% at all times during the business day. I am running a lot of blades on the wall (attached screenshots) and using the optimized profile.
for example i am seeing logs similar to this in /var/log/spike_detective.log today.
spike info: type: thread, thread id: 42026, thread name: fwk0_26, start time: 16/04/25 08:42:20, spike duration (sec): 942, initial cpu usage: 100, average cpu usage: 100, perf taken: 0
spike info: type: thread, thread id: 42795, thread name: fw_full, start time: 16/04/25 09:58:38, spike duration (sec): 5, initial cpu usage: 100, average cpu usage: 100, perf taken: 0
spike info: type: cpu, cpu core: 9, top consumer: fwk0_12, start time: 16/04/25 13:12:29, spike duration (sec): 18, initial cpu usage: 100, average cpu usage: 87, perf taken: 1
spike info: type: thread, thread id: 42019, thread name: fwk0_19, start time: 16/04/25 13:12:47, spike duration (sec): 1392, initial cpu usage: 100, average cpu usage: 100, perf taken: 1
spike info: type: thread, thread id: 42025, thread name: fwk0_25, start time: 16/04/25 13:35:53, spike duration (sec): 1119, initial cpu usage: 100, average cpu usage: 100, perf taken: 0
so looking at fwk0_25 it's telling me that this CPU was pegged at 100 percent for close to 20 minutes today.
I ran fw ctl multik print_heavy_conn to look further and found this. in relation to fwk0_25 and went to focus on the timestamp of 13:35:53
[fw_25]; Conn: x.x.65.57:53940 -> x.x.69.96:1234 IPP 6; Instance load: 61%; Connection instance load: 76%; StartTime: 16/04/25 13:14:14; Duration: 70; IdentificationTime: 16/04/25 13:14:21; Service: 6:1234; Total Packets: 3025234; Total Bytes: 4357408656; Num of PM jobs: 0; Num of MD5 jobs: 0; Num of buffers sent to Main: 0; PPE Heavy Duration: 0; Async Duration: 0; Async Bytes: 0; Async Transitions: 0; Async Avg Load: 0%; Sync Avg Load: 67%;
[fw_25]; Conn: x.x.64.27:61924 -> x.x.69.96:1234 IPP 6; Instance load: 63%; Connection instance load: 73%; StartTime: 16/04/25 13:35:46; Duration: 23; IdentificationTime: 16/04/25 13:35:49; Service: 6:1234; Total Packets: 1813629; Total Bytes: 2564973893; Num of PM jobs: 1351397; Num of MD5 jobs: 0; Num of buffers sent to Main: 0; PPE Heavy Duration: 20; Async Duration: 17; Async Bytes: 2033686855; Async Transitions: 0; Async Avg Load: 84%; Sync Avg Load: 78%;
[fw_25]; Conn: x.x.47.75:47610 -> x.x.3.180:80 IPP 6; Instance load: 68%; Connection instance load: 94%; StartTime: 15/04/25 21:38:45; Duration: 1; IdentificationTime: 15/04/25 21:38:49; Service: 6:80; Total Packets: 44301; Total Bytes: 61654694; Num of PM jobs: 0; Num of MD5 jobs: 0; Num of buffers sent to Main: 0; PPE Heavy Duration: 0; Async Duration: 0; Async Bytes: 0; Async Transitions: 0; Async Avg Load: 0%; Sync Avg Load: 68%;
[fw_25]; Conn: x.x.64.27:61944 -> x.x.69.96:1234 IPP 6; Instance load: 61%; Connection instance load: 74%; StartTime: 16/04/25 13:36:24; Duration: 14; IdentificationTime: 16/04/25 13:36:27; Service: 6:1234; Total Packets: 1110687; Total Bytes: 1570905924; Num of PM jobs: 1045543; Num of MD5 jobs: 0; Num of buffers sent to Main: 0; PPE Heavy Duration: 13; Async Duration: 13; Async Bytes: 1484700978; Async Transitions: 0; Async Avg Load: 83%; Sync Avg Load: 61%;
[fw_25]; Conn: x.x.65.57:54293 -> x.x.69.96:1234 IPP 6; Instance load: 61%; Connection instance load: 72%; StartTime: 16/04/25 13:23:08; Duration: 27; IdentificationTime: 16/04/25 13:23:14; Service: 6:1234; Total Packets: 1093095; Total Bytes: 1573465832; Num of PM jobs: 0; Num of MD5 jobs: 0; Num of buffers sent to Main: 0; PPE Heavy Duration: 0; Async Duration: 0; Async Bytes: 0; Async Transitions: 0; Async Avg Load: 0%; Sync Avg Load: 67%;
I attached a screenshot of Cpview during the time to see what the CPU usage is, and it looks like that CPU 25 is healthy? I went into the I/O (see attached) and it's some CRAZY high requests on the file system as i understand it?
I also took a screenshot of the the IO nested in-between the following spikes at 11:00AM today, and the I/O is drastically different by comparison. I also attached what the distribution looks like. normally i am seeing a healthy ratio for accelerated connections of like 95/5, but something is driving up the averages for F2F packets to 80/20 which is not as ideal.
spike info: type: cpu, cpu core: 24, top consumer: fwk0_13, start time: 16/04/25 09:40:00, spike duration (sec): 6, initial cpu usage: 83, average cpu usage: 83, perf taken: 1
spike info: type: cpu, cpu core: 31, top consumer: fw_full, start time: 16/04/25 09:58:38, spike duration (sec): 11, initial cpu usage: 85, average cpu usage: 74, perf taken: 1
spike info: type: cpu, cpu core: 7, top consumer: fwk0_16, start time: 16/04/25 13:07:03, spike duration (sec): 70, initial cpu usage: 85, average cpu usage: 80, perf taken: 0
spike info: type: cpu, cpu core: 8, top consumer: fwk0_14, start time: 16/04/25 13:08:42, spike duration (sec): 41, initial cpu usage: 84, average cpu usage: 57, perf taken: 0
some days/weeks are fine, and other times at random users will have a lot of issues. I believe it's tied to what users are doing or not doing, and the unique traffic patterns on our network.
so looking at this flow.
[fw_25]; Conn: x.x.64.27:61924 -> x.x.69.96:1234 IPP 6; Instance load: 63%; Connection instance load: 73%; StartTime: 16/04/25 13:35:46; Duration: 23; IdentificationTime: 16/04/25 13:35:49; Service: 6:1234; Total Packets: 1813629; Total Bytes: 2564973893; Num of PM jobs: 1351397; Num of MD5 jobs: 0; Num of buffers sent to Main: 0; PPE Heavy Duration: 20; Async Duration: 17; Async Bytes: 2033686855; Async Transitions: 0; Async Avg Load: 84%; Sync Avg Load: 78%;
so am i correct in saying that in 23 seconds I saw almost 2 million packets in a 2.5 GB flow. do i have a somewhat correct potential assumption that this an big I/O issue?
I think i can just disable inspect via fast_accel and or re-route this traffic around the firewall.
I wanted to see what some smarter than I think about this, if i am on the right track, or if there is a more economical way to go about doing this? right now it seems like a big game of whack a mole. fast_accel is not mean to be used liberally and it's limited to 24 rules, and i am not sure that if i had just the firewall blade that i would not be seeing the same issue, so i have to re-route this traffic around the firewall? i have even considered disabling blades to help lower the consumption.