Re: Bad performance of 23800 firewall

Pawan_Shukla · ‎2020-10-17

Hello Experts,

The 23800 firewall is not providing expected performance. It is giving maximum of 7 Gbps but as per datasheet it should give 20 Gbps of throughput with Firewall+IPS blade.

We can taken 4*10Gbps of link and created the bond interface and within the bond interface multiple subinteface are created. When traffic is flowing from one subinteface to another subinteface the max throughput is 7Gbps.

Can you please help in solving the issue.

output of fw ctl affinity -l

Mgmt: CPU 0
Sync: CPU 1
eth1-01: CPU 2
eth1-02: CPU 3
eth1-03: CPU 24
eth1-04: CPU 25
Kernel fw_0: CPU 47
Kernel fw_1: CPU 23
Kernel fw_2: CPU 46
Kernel fw_3: CPU 22
Kernel fw_4: CPU 45
Kernel fw_5: CPU 21
Kernel fw_6: CPU 44
Kernel fw_7: CPU 20
Kernel fw_8: CPU 43
Kernel fw_9: CPU 19
Kernel fw_10: CPU 42
Kernel fw_11: CPU 18
Kernel fw_12: CPU 41
Kernel fw_13: CPU 17
Kernel fw_14: CPU 40
Kernel fw_15: CPU 16
Kernel fw_16: CPU 39
Kernel fw_17: CPU 15
Kernel fw_18: CPU 38
Kernel fw_19: CPU 14
Kernel fw_20: CPU 37
Kernel fw_21: CPU 13
Kernel fw_22: CPU 36
Kernel fw_23: CPU 12
Kernel fw_24: CPU 35
Kernel fw_25: CPU 11
Kernel fw_26: CPU 34
Kernel fw_27: CPU 10
Kernel fw_28: CPU 33
Kernel fw_29: CPU 9
Kernel fw_30: CPU 32
Kernel fw_31: CPU 8
Kernel fw_32: CPU 31
Kernel fw_33: CPU 7
Kernel fw_34: CPU 30
Kernel fw_35: CPU 6
Kernel fw_36: CPU 29
Kernel fw_37: CPU 5
Kernel fw_38: CPU 28
Kernel fw_39: CPU 4
Daemon in.asessiond: CPU 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Daemon wsdnsd: CPU 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Daemon topod: CPU 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Daemon fwd: CPU 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Daemon in.acapd: CPU 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Daemon lpd: CPU 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Daemon mpdaemon: CPU 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Daemon cpd: CPU 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Daemon cprid: CPU 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

PhoneBoy · ‎2020-10-17

What precise version/JHF level?
What precisely are you doing to test performance?
What does the test traffic look like?

If it’s a single flow (one source, one destination), you’re basically creating an elephant flow.
All of our performance tests involve multiple flows.

Pawan_Shukla · ‎2020-10-17

Hello Experts,

The 23800 firewall is on the take R80.20 take :173

For testing we are using iperf3 but our requirement is to copy data from one ISILION server to another

We have observed, with single session we are are getting around 5 Gbps for the data transfer and 2 Gbps for other traffic but as soon as we increase number of session data transfer decreases to 2.5 Gbps

Pawan Shukla

HeikoAnkenbrand · ‎2020-10-17

1) I would use R80.30 or R80.40 with the latest JHF.

2) Enable multi queueing with 10 GBit/s interfaces. Without MQ a throughput of 3-5 GBit/s is possible with one interface. More read here: R80.x - Performance Tuning Tip - Multi Queue

3) Which paths (more here R80.x - Security Gateway Architecture (Logical Packet Flow)) do the packages use? Show us the output of the following commands:

# fwaccel stats -s

# top -> press 1 for all cores

4) You use 4 SND's (eth1-01, eth1-02, eth1-03, eth1-04) and 40 CoreXL instances.

-> If you strongly use the accelaration path, you should use more SND's

➜ CCSM Elite, CCME, CCTE

Pawan_Shukla · ‎2020-10-17

Hello,

Kindly find the below details, provide some solution. In my architecture, I have used 4*10Gb fiber interface. So I a

#fwaccel stats -s

Accelerated conns/Total conns : 7026/39238 (17%)
Accelerated pkts/Total pkts : 34463719650/35911020948 (95%)
F2Fed pkts/Total pkts : 1447301298/35911020948 (4%)
F2V pkts/Total pkts : 98202672/35911020948 (0%)
CPASXL pkts/Total pkts : 0/35911020948 (0%)
PSLXL pkts/Total pkts : 23921876122/35911020948 (66%)
CPAS inline pkts/Total pkts : 0/35911020948 (0%)
PSL inline pkts/Total pkts : 0/35911020948 (0%)
QOS inbound pkts/Total pkts : 0/35911020948 (0%)
QOS outbound pkts/Total pkts : 0/35911020948 (0%)
Corrected pkts/Total pkts : 0/35911020948 (0%)

#top

top - 22:28:34 up 3 days, 2:29, 1 user, load average: 2.17, 2.81, 2.76
Tasks: 615 total, 2 running, 613 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 0.6%sy, 0.0%ni, 93.4%id, 0.0%wa, 0.0%hi, 5.9%si, 0.0%st
Mem: 65746320k total, 13848148k used, 51898172k free, 329668k buffers
Swap: 33551672k total, 0k used, 33551672k free, 1946612k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5095 admin 15 0 0 0 0 S 22 0.0 60:18.99 fw_worker_24
5077 admin 15 0 0 0 0 S 19 0.0 85:41.38 fw_worker_6
5108 admin 15 0 0 0 0 S 19 0.0 220:31.43 fw_worker_37
5092 admin 15 0 0 0 0 S 18 0.0 279:32.86 fw_worker_21
5107 admin 15 0 0 0 0 S 16 0.0 364:14.04 fw_worker_36
8856 admin 15 0 862m 184m 39m S 14 0.3 235:20.76 fw_full
5078 admin 15 0 0 0 0 S 10 0.0 326:44.01 fw_worker_7
5072 admin 15 0 0 0 0 S 8 0.0 256:41.55 fw_worker_1
5110 admin 15 0 0 0 0 S 8 0.0 105:03.55 fw_worker_39
5096 admin 15 0 0 0 0 S 7 0.0 284:17.59 fw_worker_25
5098 admin 15 0 0 0 0 S 6 0.0 261:08.10 fw_worker_27
5103 admin 15 0 0 0 0 S 6 0.0 198:56.70 fw_worker_32
5086 admin 15 0 0 0 0 S 6 0.0 370:28.17 fw_worker_15
5088 admin 15 0 0 0 0 S 6 0.0 401:48.28 fw_worker_17
5080 admin 15 0 0 0 0 S 5 0.0 140:41.30 fw_worker_9
5090 admin 16 0 0 0 0 S 5 0.0 245:51.60 fw_worker_19
5094 admin 15 0 0 0 0 S 5 0.0 199:11.30 fw_worker_23
5074 admin 15 0 0 0 0 S 5 0.0 260:30.67 fw_worker_3
5081 admin 15 0 0 0 0 S 4 0.0 59:37.76 fw_worker_10
5082 admin 15 0 0 0 0 S 4 0.0 138:25.71 fw_worker_11
5083 admin 15 0 0 0 0 S 4 0.0 148:08.63 fw_worker_12
5084 admin 15 0 0 0 0 S 4 0.0 149:37.41 fw_worker_13
5076 admin 15 0 0 0 0 S 4 0.0 241:35.83 fw_worker_5
5099 admin 15 0 0 0 0 S 4 0.0 51:30.35 fw_worker_28
5100 admin 15 0 0 0 0 S 4 0.0 133:52.08 fw_worker_29
5105 admin 15 0 0 0 0 S 4 0.0 205:21.42 fw_worker_34
5085 admin 15 0 0 0 0 S 3 0.0 151:12.51 fw_worker_14
5097 admin 15 0 0 0 0 S 3 0.0 52:58.49 fw_worker_26
5075 admin 15 0 0 0 0 S 3 0.0 63:02.47 fw_worker_4
5091 admin 15 0 0 0 0 S 3 0.0 135:33.92 fw_worker_20
5104 admin 15 0 0 0 0 S 3 0.0 84:22.19 fw_worker_33
5106 admin 15 0 0 0 0 S 3 0.0 225:49.28 fw_worker_35
5071 admin 15 0 0 0 0 S 2 0.0 147:40.42 fw_worker_0
5073 admin 15 0 0 0 0 S 2 0.0 146:38.35 fw_worker_2
5079 admin 15 0 0 0 0 S 2 0.0 142:34.09 fw_worker_8
5089 admin 15 0 0 0 0 S 2 0.0 53:42.59 fw_worker_18
5093 admin 15 0 0 0 0 S 2 0.0 215:51.19 fw_worker_22

# cpmq get -a

Active ixgbe interfaces:
eth1-01 [Off]
eth1-02 [Off]
eth1-03 [Off]
eth1-04 [Off]

Non-Active ixgbe interfaces:
eth3-01 [Off]
eth3-02 [Off]
eth4-01 [Off]
eth4-02 [Off]
eth4-03 [Off]
eth4-04 [Off]

Active igb interfaces:
Mgmt [Off]
Sync [Off]

Non-Active igb interfaces:
eth2-01 [Off]
eth2-02 [Off]
eth2-03 [Off]
eth2-04 [Off]
eth2-05 [Off]
eth2-06 [Off]
eth2-07 [Off]
eth2-08 [Off]

Pawan Shukla

_Val_ · ‎2020-10-19

For a single connection, 5Gbps is rather good, considering you are using 10G interfaces in the first place.

You can increase connection's throughput it by applying FAST_ACCEL bypass for the specific IP addresses on both sides, as described in sk156672.

As @PhoneBoy said already, this is the case of a heavy connection, a.k.a. "elephan flow", which cannot be addressed by balancing FW inspection to multiple cores. The only solution is to use fast_accell feature for the IP addresses involved. Mind, you will not reach the wire speed, but throughput should be better than through Medium Path

Kaspars_Zibarts · ‎2020-10-19

As @HeikoAnkenbrand suggested - you must enable MQ on those interfaces:

eth1-01: CPU 2
eth1-02: CPU 3
eth1-03: CPU 24
eth1-04: CPU 25

Without it you won't get more than 2-3Gbps per interface.

Plus upgrade - R80.20 was not such a great release. If you want the best performance on interface, go with R80.40 as you will get 3.10 kernel that has much newer drivers for multi queue.

We are in process of upgrading our 23800 just for that - to eliminate RX-DRPs on interfaces when traffic is very "bursty", it can jump from 10Gbps to 20Gbps suddenly so we see a lot of RX buffer overflows

HeikoAnkenbrand · ‎2020-10-20

Hi @Pawan_Shukla,

I agree with @Kaspars_Zibarts.

Here's what I would do in this case:

First enable MQ:
By default, each network interface has one traffic queue handled by one CPU. You cannot use more CPU cores for acceleration than the number of interfaces handling traffic. Multi-Queue lets you configure more than one traffic queue for each network interface. For each interface, more than one CPU core is used for acceleration. Multi-Queue is relevant only if SecureXL is enabled.

More read here:
- R80.x - Performance Tuning Tip - Multi Queue
- Performance Tuning R80.30 Administration Guide – Multi-Queue

Ennable MQ in this interfaces:

eth1-01: CPU 2
eth1-02: CPU 3
eth1-03: CPU 24
eth1-04: CPU 25

Second:

Use "Fast Accel Rules" for internal networks. So you have less traffic on the PSLXL path or use IPS only for traffic to and from the internet on the external interface.

More read here:
- R80.x - Performance Tuning Tip - SecureXL Fast Accelerator (fw ctl fast_accel)
- sk156672 - SecureXL Fast Accelerator (fw fast_accel) for R80.20 and above

Third:

Check RX errors:
# netstat -in

RX-ERR: Should be zero. Caused by cabling problem, electrical interference, or a bad port. Examples: framing errors, short frames/runts, late collisions caused by duplex mismatch.
Tip: First and easy check duplex mismatch

RX-OVR: Should be zero. Overrun in NIC hardware buffering. Solved by using a higher-speed NIC, bonding multiple interfaces, or enabling Ethernet Flow Control (controversial).

Tip: Use higher speed NIC's or bond interfaces

RX-DRP: Should be less than 0.1% of RX-OK. Caused by a network ring buffer overflow in the Gaia kernel due to the inability of SoftIRQ to empty the ring buffer fast enough. Solved by allocating more SND/IRQ cores in CoreXL (always the first step), enabling Multi-Queue, or as a last resort increasing the ring buffer size.

More tips read here:
- R80.x Architecture and Performance Tuning - Link Collection

➜ CCSM Elite, CCME, CCTE

Timothy_Hall · ‎2020-10-17

Please provide output of Super Seven commands:

Super Seven Performance Assessment Commands (s7pac.

Depending on your policy if a large percentage of iperf traffic is accelerated you are only using six of the available 48 CPUs. Very likely you have RX-DRP as shown by output of Super Seven which is what is slowing you down. You almost certainly need to adjust your CoreXL split and turn on Multi-Queue.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

Are you a member of CheckMates?

Bad performance of 23800 firewall