Re: R80.30 - Slow DMZ transfer performance

Howard_Gyton · ‎2019-10-09

Hi,

We recently upgraded our firewall cluster from R77.30 to R80.30.

Almost immediately after this, we noticed an observed degredation to the performance of a couple of our VMs, one being an SSH gateway, and the other being an ownCloud server. Both have NFS mounts to our Nexenta storage.

Running some tests, we found extremely slow transfer rates to and from each VMs local disk. I was in the process of building a new ownCloud server, and have ended up using this as an analogue for further testing.

Taking a ~250MB file on my local disk and SCPing the file up to my test VM, with it being in the same VLAN as my workstation, this file tranfers up and down in about 3 seconds flat.

When I then move this VM into a sub interface (DMZ) on our firewall and run the same tests, it has take as long as 2 minutes 20 seconds to perform the same test.

I even created a new threat prevention policy that has none of the inspection blades enabled, and matching this host, and managed to get variable rates from just under 2 minutes to around 1 minute 18 seconds.

We also do not use QoS.

I have just raised a ticket with our support partner, but was very interested to know whether any other users had observed this? We had noticed a performance degradation under R77.30 but no where near this bad.

Howard

Timothy_Hall · ‎2019-10-09

I assume you are using the UDP protocol with NFS, please see this thread:

https://community.checkpoint.com/t5/Enterprise-Appliances-and-Gaia/Message-seen-on-var-log-messages-...

Attend my 60-minute "Be your Own TAC: Part Deux" Presentation
Exclusively at CPX 2025 Las Vegas Tuesday Feb 25th @ 1:00pm

Howard_Gyton · ‎2019-10-09

We have a number of ports included in the rule that grants NFS access:

TCP-2049

UDP-2049

RPC/10003,10005,100021,150001

TCP-4045

UDP-4045

TCP-4046

It was "fine" before the upgrade, although I do seem to remember raw transfers outside of ownCloud being a little slower than I expected, right now it can as low as 20x slower that it was before, and is practically unusable.

We've had some small gains from completely disabled threat prevention for this machine by creating a new policy in Threat Prevention which has no blades selected, and creating a new rule that uses this policy and host object referencing the ownCloud server.

My client is not being forcibly disconnected randomly as it was before, but the slow speed persists.

And there is also the fact that it affects SCP as well.

Howard

Timothy_Hall · ‎2019-10-09

Please provide command outputs from the "Super Seven" for further analysis:

https://community.checkpoint.com/t5/General-Topics/Super-Seven-Performance-Assessment-Commands-s7pac...

Attend my 60-minute "Be your Own TAC: Part Deux" Presentation
Exclusively at CPX 2025 Las Vegas Tuesday Feb 25th @ 1:00pm

Howard_Gyton · ‎2019-10-10

After a remote session the other day, we found that disabling acceleration "fixed" the issue, and we continue to investigate.

Timothy_Hall · ‎2019-10-10

The sometimes problematic reordering of UDP packets that I mentioned in my earlier reply is performed by SecureXL, so it would make sense that disabling SecureXL would solve your performance problem. Sure sounds like your issue to me...

Attend my 60-minute "Be your Own TAC: Part Deux" Presentation
Exclusively at CPX 2025 Las Vegas Tuesday Feb 25th @ 1:00pm

Howard_Gyton · ‎2019-10-10

Hmm. Although that shouldn't effect NFS and SSH/SCP as they are TCP, as far as I am aware, and these are what are causing us the most problems.

We've been pointed to the following sk and a hotfix is in the works. Hopefully this will make it into a future take and not need to be requested. Hopefully, it also won't be tied to a particular take, as that makes upgrading more trouble than it's worth.

https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsolutiondetails=&solut...

The traffic we have observed causing issues is not UDP based, but if UDP traffic is affected and this causes a bottleneck for other traffic then we'll have to wait and see if the hotfix works.

Howard

Howard_Gyton · ‎2019-10-21

UPDATE: A response Check Point QA indicates this is an IPS issue. The protection in question was:

Microsoft TCP IP Selective Acknowledgement Denial of Service (MS10-009)

If this protection is configured as either Detect(Default), or Prevent, then transfer speeds are reduced by as much as 20 times. Creating a global IPS exception for the hosts affected for just that protection allows for full transfer speeds, even with firewall accelearation on.

A fix for this protection is being investigated.

A secondary protection was also mentioned, "Web Servers HTTP POST Denial of Service", but I found this to have no bearing on the issue, and could be safely left configured as "Prevent", the default, without impact on performance.

Howard

Timothy_Hall · ‎2019-10-21

Hmm interesting, the "Microsoft TCP IP Selective Acknowledgement Denial of Service (MS10-009)" protection does have a performance impact rating of Critical which means that all traffic subject to inspection by this protection must be handled in F2F (not accelerated). Sounds like your earlier report that disabling acceleration fixed the issue was a red herring. For future reference one can diagnose issues like this by simply running "ips off" directly on the gateway, waiting 60 seconds, then seeing if performance substantially improves. Obviously don't forget to run "ips on" when testing is complete...

Did TAC use the procedure in the following SK to isolate the signature? sk110737: IPS Analyzer Tool - How to analyze IPS performance efficiently

Attend my 60-minute "Be your Own TAC: Part Deux" Presentation
Exclusively at CPX 2025 Las Vegas Tuesday Feb 25th @ 1:00pm

Howard_Gyton · ‎2019-10-21

I don't know what technique QA/RnD used but the procedure I used to collect relevant log data was as follows:

fw ctl debug 0

fw ctl debug -buf 32000

fwaccel dbg + conn offload

fw ctl kdebug -T -f > kdebug.txt

I also supplied the following file:

$FWDIR/log/ips_profiles_report.csv

With acceleration turned off, that protection has no impact on throughput when enabled, only when both the protection and acceleration are enabled do you see drop off.

Incidentally, something I discovered earlier was that FTP transfers seem unaffected. One of our boxes is an FTP server with external access. FTP transfers are at full speed, but SSH/SCP are affected, This box is currently not in the IPS exception I created.

Howard

JosAndel · ‎2019-11-25

Hi Howard,

We experienced about the same after upgrading from R77.30 to R80.30. Lots of vague complains by coworkers reporting performance degradation. After about three weeks we managed to solve the issue.

- disabling certain checks in IPS didn't help,

- disabling IPS and Threat Prevention for certain traffic flows didn't help,

- CPU load was only about 10 to 15%, memory load about 10%

- our support partner had no idea and couldn't really help us.

After about 2.5 weeks we found out the troubles seemed to be caused by packet loss. From internet to a loadbalancer we measured about 3% loss. That loadbalancer goes to backend servers, through the firewall again, and about 3% loss as well.

We are running our cluster on VMWare. With use of tcpdump we could proove that:

- packets were always arriving at the CheckPoint VM,

- packets were always leaving the CheckPoint VM (according to tcpdump and fw monitor),

- but packets were not always leaving the hardware where the VM is running.

So it seemed that packets were lost in the layer between the CheckPoint VM and the hardware. Looking for best practices for running CheckPoint on VMware, we found this document:

https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsolutiondetails=&solut...

Our VM was configured with E1000 NICs, which seems to be the default on VMWare ESXi 6.0. We changed that to VMXNET3 (remove E1000 NICs, add VMXNET3 NICs) on our Standby Cluster node. After reboot all the interface names and IP's were still correct, so after reboot the cluster was formed in a normal matter. On a quiet moment we did a failover to test the new configuration. 3 days later the graphs of our monitoring look much better. We don't experience packet loss any more. A SCP filetransfer which ran at 1.3MB/s now runs at almost 11MB/s.

So changing from the E1000 driver to VMXNET3 did the trick for us.

Jos

Kaloyan_Kirchev · ‎2020-04-23

Hmm, interesting topic.

We have similar issue. 100% IPS.

When disabled traffic is perfect. When enabled end users internet traffic is slow as f...

There was manual update and rules rewrite when update was done from R80.10 to R80.30.

We did not get was the issue with GAIA SUSE update procedure.

Now R80.30 and this...

Timothy_Hall · ‎2020-04-26

If you have a problem you'd like addressed, please start a new thread with the specifics of your situation instead of adding on to an old thread that sounds similar, but may or may not be relevant to your problem.

Attend my 60-minute "Be your Own TAC: Part Deux" Presentation
Exclusively at CPX 2025 Las Vegas Tuesday Feb 25th @ 1:00pm

Jesse · ‎2020-04-27

Try this:

https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsolutiondetails=&solut...

Kaloyan_Kirchev · ‎2020-04-29

I just want give feedback.

CP fw fast accel fits just perfectly fine.

In my case it did solve the problem 🙂

Thanks to @Jesse

Are you a member of CheckMates?

R80.30 - Slow DMZ transfer performance