Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Michel_B
Participant

Troubleshooting performance issues

I'm having issues with a 4200 appliance and it's performance. While these issues have been going for a while, they are becoming quite disruptive lately. 

The setup is a 4200 appliance running R80.10 (but also R77.30 had these issues), with only the Firewall, VPN, Identity Awareness, Application & URL filtering blades enabled. The problem is, I have no clue as to what's causing this and my troubleshooting skills are not up to par. I hope you can give me a clue.

More often than not, I see the gateway's CPU peak to 99%. Sometimes, when I check the top connections with cpview, I see a client downloading a file over https (inspection not enabled) with 60Mbit/s over our WAN connection. While this, in my opinion, shouldn't cause a gateway to max out, I can understand. But other times, I see no visible clue as to why this is. I will hardly see any traffic in cpview, but 'top' gives me an output like the one below.

Top overview

This is not always the same. Sometimes you see a fw_worker, cpd or pdpd as the #1 CPU user.

fwaccel stat

Accelerator Status : on
Accept Templates : disabled by Firewall
Layer Network disables template offloads from rule #174
Throughput acceleration still enabled.
Drop Templates : enabled
NAT Templates : disabled by Firewall
Layer Network disables template offloads from rule #174
Throughput acceleration still enabled.
NMR Templates : enabled
NMT Templates : enabled

Accelerator Features : Accounting, NAT, Cryptography, Routing,
HasClock, Templates, Synchronous, IdleDetection,
Sequencing, TcpStateDetect, AutoExpire,
DelayedNotif, TcpStateDetectV2, CPLS, McastRouting,
WireMode, DropTemplates, NatTemplates,
Streaming, MultiFW, AntiSpoofing, Nac,
ViolationStats, AsychronicNotif, ERDOS,
McastRoutingV2, NMR, NMT, NAT64, GTPAcceleration,
SCTPAcceleration
Cryptography Features : Tunnel, UDPEncapsulation, MD5, SHA1, NULL,
3DES, DES, CAST, CAST-40, AES-128, AES-256,
ESP, LinkSelection, DynamicVPN, NatTraversal,
EncRouting, AES-XCBC, SHA256

fwaccel stats -s

Accelerated conns/Total conns : 243/2697 (9%)
Delayed conns/(Accelerated conns + PXL conns) : 70/1516 (4%)
Accelerated pkts/Total pkts : 170686/2775658 (6%)
F2Fed pkts/Total pkts : 196461/2775658 (7%)
PXL pkts/Total pkts : 2408511/2775658 (86%)
QXL pkts/Total pkts : 0/2775658 (0%)

fwaccel stats -p

F2F packets:
--------------
Violation Packets Violation Packets
-------------------- --------------- -------------------- ---------------
pkt is a fragment 470 pkt has IP options 0
ICMP miss conn 1292 TCP-SYN miss conn 20969
TCP-other miss conn 3018 UDP miss conn 19495
other miss conn 0 VPN returned F2F 0
ICMP conn is F2Fed 9904 TCP conn is F2Fed 121746
UDP conn is F2Fed 18277 other conn is F2Fed 0
uni-directional viol 0 possible spoof viol 0
TCP state viol 2785 out if not def/accl 882
bridge, src=dst 0 routing decision err 1550
sanity checks failed 0 temp conn expired 0
fwd to non-pivot 0 broadcast/multicast 0
cluster message 0 partial conn 1576
PXL returned F2F 11879 cluster forward 0
chain forwarding 0 Tmpl no-match range 5
Tmpl no-match time 0 general reason 6
route change 0 inbound zone change 0
outbound zone change 0

I have cleaned up my rulebase as much as I possibly can right now. Because of the recent upgrade from R77.30 to R80.10 I haven't been able to convert my rulebase to a layered one yet.

How can I find out what's causing these issues?

9 Replies
JozkoMrkvicka
Mentor
Mentor

How many rules you have for this guy ?

What is rule #174 which is disabling acceleration ?

Did you try temporary turn off IA or URL filtering to see if you have root cause ?

Kind regards,
Jozko Mrkvicka
Kaspars_Zibarts
Employee Employee
Employee

From memory 4200 has only two CPU cores. Doing simple maths from your screenshot it adds up to 165%. It does not show soft interrupts for SXL. So that would take some usage too. I'm just guessing that your CPUs are maxed out. Check with top (option 1) so you see both core utilisation in detail when it happens. How much idle time you see left on each?

As suggested, try turning off advanced blades (URLF and AC) that pushes traffic to medium path (PXL is 86% in your stats). 

Michel_B
Participant

Jozko Mrkvicka‌ Rule #174 is one of the last rules before the default deny. It consists of some RPC services.

Kaspars Zibarts‌ I'm also guessing the CPU is simply maxed out here. That would explain the high percentages of CPU usage for "less obvious" processes I guess?

Turning off AP and IA is not really an option right now, as this has too much impact on the environment. I could try to disable URLF.

0 Kudos
Kaspars_Zibarts
Employee Employee
Employee

You can see that SXL (traffic arriving on interfaces) is chewing a lot too

therefore you don't have much CPU idle time left there. So my guess would be that these gateways would struggle to cope with slightest traffic peaks due to CPU time shortage. You need to get bigger boxes  You also mentioned that you might be running R80.10 on those. I know 4000 series are supported, but bear in mind that R80.10 is much more CPU and memory hungry compare to R77.30.

Do you have any monitoring tools in place to see CPU history per core? Else just check manually (if possible) when it becomes unstable / unresponsive - see how much idle time is left.

JozkoMrkvicka
Mentor
Mentor

There are couple of OIDs available for CPU checking via SNMP (per core / overall).

Once you have OID, you can create script which will run every XY seconds/minutes snmpwalk towards desired OID and output save to file.

Some usable links:

How to query utilization of individual CPU cores via SNMP 

How to configure SNMP on Gaia OS 

Best Practices - SNMP 

Kind regards,
Jozko Mrkvicka
Michel_B
Participant

Thanks for the help guys. I'm going to look into monitoring the individual cores. We have monitoring in place for general CPU usage but upon further inspection, that's doesn't prove to be very useful. Also, the main issue is spikes. Generally speaking, the performance is "okay" (as in, CPU could be consuming 90% but users are not yet affected), until something happens, like a download or a sudden higher amount of connections. The whole thing comes crashing down in flames.  The monitoring hardly ever picked up on these spikes. I'll look into the OID's

For now, I have switched back to R77.30, this helps, although we're still very close to the limits of the gateway.

Back to the drawing board, and wait until someone spends the money on "bigger boxes". Smiley Happy

Kaspars_Zibarts
Employee Employee
Employee

Yep, been through this cycle few times Smiley Happy not the strongest point for CP - HW useage increases with every release. But there's always two sides to the coin - you get a lot of new features.

0 Kudos
JozkoMrkvicka
Mentor
Mentor

If you know exact time of the spike, you can use SmartLog to found out what was going high during that time. SmartLog has some nice statistics available Smiley Happy

Another way is cpview history (if you have enabled that), but for that I dont know how to check stats from past. Maybe using sql ?

Kind regards,
Jozko Mrkvicka
0 Kudos
Aidan_Luby
Collaborator

You can simply use the command cpview -t to go to the oldest cpview moment in history and scroll through time per minute with the + and - keys on your keyboard. For more specificity you can also use the syntax cpview -t 03 Sept 2019 09:05 for example to search for an exact time and then still scroll through the timeline with the + and - keys.

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events