Re: Problem with CPUSE, SCP and policy push - disa...

Kenneth_Greger1 · ‎2019-02-25

Hi CheckMates

I'm writing this post to hear if there is anybody else who has experienced these strange problems that we have been facing for the last 6-9 months. Our vendor has not had any customers with the same problems, and it seems to me that the problem is not very well known to CheckPoint TAC either.

We are seeing 3 various problems on our R80.10 T169 VE platforms.
I'm not sure if they are related, but I'm not ruling it out.

Problems using CPUSE to obtain patches
Problems copying (SCP) files between cluster members, and (sometimes) from a random server to a cluster member (this problem became apparant when we had to manually copy the patches to our cluster members since CPUSE was not working)
A lot of timeouts when we push policy to a cluster. Normally it will only fail to the active member.
Sometimes we have to push to a cluster. Let it fail on active member, do a clusterXL_admin down on active member - and then push again. Just to get the job done.
I should mention that after applyin JHF 169 the problem has become less apparent, but it's still there.

We are running 22 CheckPoint cluster on our VmWare platform, and we see the same problems on all of them.
Our hardware is fairly modern (mostly Dell PowerEdge R730xd), and with updated version of ESX (VMware ESXi, 6.7.0).

Our WAN consists of high-speed MPLS network with latency below 50ms, and long-haul IPSec connections with 250-300ms latency.

The firewall clusters that are connected to the MPLS network fails just as often as those on slow IPSec connections.

When we try to use CPUSE (via WebUI) we typically get an error saying: "The package failed to download at [date]. Reason of failure: Does not match Expected SHA1"

When we try to SCP a file from one cluster member to another we would get one of the two error messages:

# scp big.file adminscp@x.x.x.x:.
big.file 0% 0 0.0KB/s --:-- ETA
Received disconnect from x.x.x.x: 2: Corrupted MAC on input. lost connection

Or we get this error message:

Write failed: Broken pipe

We have been doing some extensive troubleshooting based on input from both vendor and CheckPoint TAC.
We might have solved the problem by disabling AES-NI in VmWare (on guest level). At least we have not seen this problem on the two clusters we have performed the change. We tried first to disable AES-NI within the CheckPoint installation itself, but that didn't have any effect.

The solution was to disable/hide the AES-NI extension by using CPUID masking on the guest itself.
And yes, AES-NI is supported on our hardware platform.

All our firewalls are clean-installed R80.10 gateways, and even the SmartCenter was rebuilt from scratch.

We did not experience these problems on R77.30 (and below), running on the same virtual/hardware.

I'm no expert, but disabling AES-NI does not feel good at all, and it seems that we are omitting a more fundamental problem in the platform. CheckPoint TAC has asked us to raise a ticket with both Dell, Intel and VmWare, but I have a feeling that it would bounce right back in my face...

So, my question again; has anybody else seen this? If yes, were you able to solve it in a meaningful way?

/Kenneth

Maarten_Sjouw · ‎2019-02-25

Kenneth,

Could there be any problem with MTU size? Did you check if there is any fragmentation on any of the links?

We have a lot of customers and a lot of different types of traffic, MPLS and VPN but also small internet connections with PPPoE (with a 8 byte header).

We see a lot of need to do MSS clamping, but for your situation you could try to make a smal change to the MTU of the Management system and set it back as far as 1400. In a DMVPN situation you might even need to bring it down further.

Now I hear a lot of people shout don't do that but the very small loss in performance can on the other connection bring you a gain that you would never expect.

Regards, Maarten

Kenneth_Greger1 · ‎2019-02-25

Maarten,

It could maybe solve the issue with policy push, but it would probably not explain why we see these strange errors when we SCP files between the cluster members. Don't you agree?

I will anyway give the MTU suggestion a go to see if this actually resolves the policy push issue.

Maarten_Sjouw · ‎2019-02-26

Kenneth, Indeed between machines it should not be a problem, unless the interfaces are 10Gb interfaces and Jumbo frames are running on it and somewhere under water there is an issue with those. It is just one of those things.

Regards, Maarten

B_P · ‎2019-03-26

We had (have) the exact same issue. We had older hardware that would work and newer hardware that wouldn't. Turned out it depended on if the hardware supported AES-NI. Once we disabled it, the problems we had (same as you described) went away. But, that comes at a performance cost to our PTP VPN.....

Their recommendation to get with VMware doesn't make sense because currently there is no way to tell if Check Point is seeing it or not. The dmesg | grep "AES-NI" command in sk110549 doesn't work. So what can VMware even do?

Aleksandr_Nosit · ‎2019-04-11

Hi Kenneth,

i had a same issue as You described; it ended up with changing servers to same type but model with different CPUs as it came out what there is some interoperability issues between AES-NI in R80.10 and this specific CPU in cae of VSEC-VE; normal SG on same HW worked without such issue. Or,if AES-NI in not a dealbraker - disable it on hyper-visor level

/Alec

Marco_Valenti · ‎2019-04-11

would be nice to ear something from cp regarding this issue I assume the hardware involved was on check point HCL

Aleksandr_Nosit · ‎2019-04-11

In case GW is deployed on ESXi - servers doesnt have to be on Checkpoint HCL, as platfrom is VMWARE in this case. Issue i had was ONLY if GW was deployed on top of ESXi and server had specific Intel CPU - E5-2643v4 if i remember correctly; same server model with different CPU was fine and also same serve was fine if GW was deployed on bare metal HW.

/Alec

Timothy_Hall · ‎2019-04-11

All the problematic functions you described utilize the OpenSSL library in process/user space to access algorithms like AES. I can't recall any situations I've ever encountered that required disabling AES-NI, in fact I am a strong proponent of using hardware architectures that support it. Since you are using R80.10 you must be using the 2.6.18 kernel with its corresponding version of OpenSSL. This OpenSSL version is "old" but is kept up to date by Check Point for security fixes and such. It has got to be some way that OpenSSL or perhaps the 2.6.18 kernel itself is interacting with VMWare that is causing the AES-NI issue since it does not occur on bare metal gateways. I would imagine that gateway kernel-based VPN encryption functions in the SXL and F2F paths are accessing AES-NI directly from the processor instead of via the OpenSSL library, and as such would not be affected.

Would be very interested to see if this problem continues with the new 3.10 kernel that has a presumably updated version of OpenSSL. This will of course require an upgrade to at least code level R80.20 as well. Might be interesting to see if Check Point can provide a fully-updated version of OpenSSL for the 2.6.18 kernel on R80.10, but I doubt that is possible.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Are you a member of CheckMates?

Problem with CPUSE, SCP and policy push - disable AES-NI?