Solved: Multiple cores for medium path traffic

Marc_Zacho1 · ‎2017-11-10

Hi,

I'm doing some throughput test on a vSEC gateway in network mode (basically just a VM with GAiA installed afaik), on a NSX/ESXi environment .

The test is done with a basic setup, one gateway and two Ubuntu VM's acting as client / server. To measure througput I'm using Iperf (TCP, basic settings).

The problem is when I enable both IPS and Application Awareness. With both blades enabled I'm only able to get a throughput around 1.5 Gbps. With just one of the blades its around 5 Gbps, without any blades (except FW) its 6 Gbps, which seems to be a driver limit (e1000 vs only 4.5 Gbps on VMXNET3)

I have tried to play around with the core allocation, but without luck. There are no difference if the fw workers have a dedicated core, or are able to use all available cores.

According to fwaccel stats -s, above 90% of the traffic hits PXL.

So my question is, is it possible to split the IPS and APP awareness processes to different CPU's or just load-share the PXL part even more?

Timothy_Hall · ‎2017-11-11

With a single iperf stream like that, all packets for that stream's connection must be processed on the same Firewall Worker core regardless of whether the Dynamic Dispatcher is on or not. Letting the packets get handled by multiple workers would raise of the specter of out-of-order delivery which is a complete disaster from a TCP performance perspective.

Try a IPS profile of Default_Protection (called "Optimized" in R80+) which may help but the presence of APCL guarantees all that traffic will go Medium path. Also make sure you do not have a an explicit Accept cleanup rule at the bottom of the APCL policy, and avoid using "Any" in the Source/Dest of any APCL rule, use explicit network objects in the Source and "Internet" in the Destination. That's about it.

--
My book "Max Power: Check Point Firewall Performance Optimization"
now available via http://maxpowerfirewalls.com.

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization

View solution in original post

PhoneBoy · ‎2017-11-10

Not sure how that would be possible since IPS and Application Control use the same infrastructure behind the scenes.

In any case, what version of code are we talking here, since you didn't mention that in the post?

What is the configuration of the vSEC instance (number of vCores, ram allocated, etc)

Marc_Zacho1 · ‎2017-11-10

Hi Phoneboy,

Sorry forgot to mention that, its the R77.30 release, build 060 (basic OVF version).

I have tested with a different amount of vCPU's and memory, currently I have 4 vCPU and 4096GB RAM.

Currently CPU allocation (I know the allocation is messy, but I tried to isolate fw_3 and make it share CPU2 and CPU3):

fw ctl affinity -l -v -r
CPU 0: eth0 (irq 75) eth4 (irq 83) eth1 (irq 99) eth2 (irq 115) eth3 (irq 123) eth6 (irq 91) eth5 (irq 107)
fw_0
mpdaemon vpnd rad fwd cprid cpd
CPU 1: fw_1 fw_2
CPU 2: fw_3
CPU 3: fw_3
All:

Kaspars_Zibarts · ‎2017-11-10

Hi, CPU allocation does not look great at all. You are mixing SXL (accelerated) traffic on cpu0 with fwk. Not a great idea. Not in traditional firewall at least. I haven't worked much with vSEC so I won't say too much but try separating interfaces from fwk instances and allocate more cores if you can. CP is all about CPU after all..

Out of curiousity, do you have CPU stats when it maxes out? All flat out? Check if dynamic core allocation is enabled in your version of R77.30. If you can add two more CPUs to your VM and then use 0&1 as generic cores and 2-5 as statically allocated cores for fwk0-3.

Marc_Zacho1 · ‎2017-11-11

I know the allocation is crap for a production firewall, but this is only used for test purpose.

I could assign more cores, but I doubt it would help me, because it's only fw_worker_3 which is using the CPU.

Top during a Iperf run, its random if its CPU2 or CPU3 which is used, but always only one of them, so it does some kind of sharing, but it seems like one session can bring it to the ground, if it matches all blades:

[Expert@mazcptest01:0]# top
top - 09:50:32 up 4 days, 53 min, 2 users, load average: 1.00, 0.52, 0.22
Tasks: 129 total, 5 running, 124 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.0%us, 0.3%sy, 0.0%ni, 76.4%id, 0.0%wa, 1.3%hi, 21.9%si, 0.0%st
Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi,100.0%si, 0.0%st
Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 3907592k total, 2878280k used, 1029312k free, 219768k buffers
Swap: 2128604k total, 0k used, 2128604k free, 1398696k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4495 admin 18 0 0 0 0 R 99 0.0 10:38.32 fw_worker_3

Timothy_Hall · ‎2017-11-11

> I could assign more cores, but I doubt it would help me, because it's only fw_worker_3 which is using the CPU.

Load the latest GA jumbo HFA onto the gateway, then turn on the Dynamic Dispatcher (fw ctl multik set_mode 9), reboot the system and try your test again. Dynamic Dispatcher is enabled by default on R80.10+ gateway but off by default in R77.30.

--
My book "Max Power: Check Point Firewall Performance Optimization"
now available via http://maxpowerfirewalls.com.

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization

Marc_Zacho1 · ‎2017-11-11

Hi Tim,

I did try the Dynamic Dispatcher on both mode 9 and 4, no big difference at all - at least not with a single Iperf stream.

But to be sure, I just installed the latest HFA:

Was:
Check_Point_R77_30_JUMBO_HF_1_Bundle_T225_FULL.tgz
Now:
Check_Point_R77_30_JUMBO_HF_1_Bundle_T292_FULL.tgz

But it's still the same, CPU at 100% and around 1.5 Gbps throughput, but if I use multiple streams, I utilise more CPU's now! (not sure that were the case with take 225, but it might be)

Now I'm able to get around 2.6 Gbps in total, and I guess it would be higher with more CPU's/workers.

Do you guys think its possible to get more than 1.5 Gbps "per core" with IPS+APP enabled?

Btw. I already got your book by my side, great work learnt a lot form it!

Hugo_vd_Kooij · ‎2017-11-11

I think 1.5 Gbps per core WITH those blades is more then I would expect.

<< We make miracles happen while you wait. The impossible jobs take just a wee bit longer. >>

Timothy_Hall · ‎2017-11-11

With a single iperf stream like that, all packets for that stream's connection must be processed on the same Firewall Worker core regardless of whether the Dynamic Dispatcher is on or not. Letting the packets get handled by multiple workers would raise of the specter of out-of-order delivery which is a complete disaster from a TCP performance perspective.

Try a IPS profile of Default_Protection (called "Optimized" in R80+) which may help but the presence of APCL guarantees all that traffic will go Medium path. Also make sure you do not have a an explicit Accept cleanup rule at the bottom of the APCL policy, and avoid using "Any" in the Source/Dest of any APCL rule, use explicit network objects in the Source and "Internet" in the Destination. That's about it.

--
My book "Max Power: Check Point Firewall Performance Optimization"
now available via http://maxpowerfirewalls.com.

New Book: "Max Power 2026" Coming Soon
Check Point Firewall Performance Optimization

Marc_Zacho1 · ‎2017-11-11

IPS and APP profile are already as described (which you also mention in Max Power ).

But, I managed to get above 5Gbps (6Gbps is the limit with the e1000 driver) with 8 Cores, 6 fw workers (one core each) and 1 core for each interface used. Iperf were also set to use 6 parallel streams.

I got what I needed, 1.5 Gbps per Core is OK, and it seem that the dynamic dispatcher does it job OK. Its a bit random how well the connections are shared between the cores, but if I eg. uses 12 streams, all fw workers are doing its job.

Iperf output:

root@smokeping01:~# iperf -c 192.168.1.3 -p 8080 -t 20 -P 12
------------------------------------------------------------
Client connecting to 192.168.1.3, TCP port 8080
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 14] local 192.168.2.3 port 35138 connected with 192.168.1.3 port 8080
[ 5] local 192.168.2.3 port 35118 connected with 192.168.1.3 port 8080
[ 4] local 192.168.2.3 port 35120 connected with 192.168.1.3 port 8080
[ 6] local 192.168.2.3 port 35122 connected with 192.168.1.3 port 8080
[ 7] local 192.168.2.3 port 35124 connected with 192.168.1.3 port 8080
[ 8] local 192.168.2.3 port 35126 connected with 192.168.1.3 port 8080
[ 10] local 192.168.2.3 port 35128 connected with 192.168.1.3 port 8080
[ 11] local 192.168.2.3 port 35130 connected with 192.168.1.3 port 8080
[ 9] local 192.168.2.3 port 35132 connected with 192.168.1.3 port 8080
[ 13] local 192.168.2.3 port 35136 connected with 192.168.1.3 port 8080
[ 12] local 192.168.2.3 port 35134 connected with 192.168.1.3 port 8080
[ 3] local 192.168.2.3 port 35116 connected with 192.168.1.3 port 8080
[ ID] Interval Transfer Bandwidth
[ 5] 0.0-20.0 sec 1.03 GBytes 441 Mbits/sec
[ 6] 0.0-20.0 sec 1.22 GBytes 525 Mbits/sec
[ 3] 0.0-20.0 sec 1.16 GBytes 500 Mbits/sec
[ 14] 0.0-20.0 sec 722 MBytes 303 Mbits/sec
[ 4] 0.0-20.0 sec 1.12 GBytes 480 Mbits/sec
[ 8] 0.0-20.0 sec 1.16 GBytes 499 Mbits/sec
[ 10] 0.0-20.0 sec 1.12 GBytes 482 Mbits/sec
[ 13] 0.0-20.0 sec 780 MBytes 327 Mbits/sec
[ 12] 0.0-20.0 sec 785 MBytes 329 Mbits/sec
[ 7] 0.0-20.0 sec 1.11 GBytes 476 Mbits/sec
[ 11] 0.0-20.0 sec 1.16 GBytes 497 Mbits/sec
[ 9] 0.0-20.0 sec 988 MBytes 414 Mbits/sec
[SUM] 0.0-20.0 sec 12.3 GBytes 5.27 Gbits/sec

I guess that's all. Thanks for the help and replys!

Are you a member of CheckMates?

Multiple cores for medium path traffic