Hi CheckMates
I'm writing this post to hear if there is anybody else who has experienced these strange problems that we have been facing for the last 6-9 months. Our vendor has not had any customers with the same problems, and it seems to me that the problem is not very well known to CheckPoint TAC either.
We are seeing 3 various problems on our R80.10 T169 VE platforms.
I'm not sure if they are related, but I'm not ruling it out.
- Problems using CPUSE to obtain patches
- Problems copying (SCP) files between cluster members, and (sometimes) from a random server to a cluster member (this problem became apparant when we had to manually copy the patches to our cluster members since CPUSE was not working)
- A lot of timeouts when we push policy to a cluster. Normally it will only fail to the active member.
Sometimes we have to push to a cluster. Let it fail on active member, do a clusterXL_admin down on active member - and then push again. Just to get the job done.
I should mention that after applyin JHF 169 the problem has become less apparent, but it's still there.
We are running 22 CheckPoint cluster on our VmWare platform, and we see the same problems on all of them.
Our hardware is fairly modern (mostly Dell PowerEdge R730xd), and with updated version of ESX (VMware ESXi, 6.7.0).
Our WAN consists of high-speed MPLS network with latency below 50ms, and long-haul IPSec connections with 250-300ms latency.
The firewall clusters that are connected to the MPLS network fails just as often as those on slow IPSec connections.
When we try to use CPUSE (via WebUI) we typically get an error saying: "The package failed to download at [date]. Reason of failure: Does not match Expected SHA1"
When we try to SCP a file from one cluster member to another we would get one of the two error messages:
# scp big.file adminscp@x.x.x.x:.
big.file 0% 0 0.0KB/s --:-- ETA
Received disconnect from x.x.x.x: 2: Corrupted MAC on input. lost connection
Or we get this error message:
Write failed: Broken pipe
We have been doing some extensive troubleshooting based on input from both vendor and CheckPoint TAC.
We might have solved the problem by disabling AES-NI in VmWare (on guest level). At least we have not seen this problem on the two clusters we have performed the change. We tried first to disable AES-NI within the CheckPoint installation itself, but that didn't have any effect.
The solution was to disable/hide the AES-NI extension by using CPUID masking on the guest itself.
And yes, AES-NI is supported on our hardware platform.
All our firewalls are clean-installed R80.10 gateways, and even the SmartCenter was rebuilt from scratch.
We did not experience these problems on R77.30 (and below), running on the same virtual/hardware.
I'm no expert, but disabling AES-NI does not feel good at all, and it seems that we are omitting a more fundamental problem in the platform. CheckPoint TAC has asked us to raise a ticket with both Dell, Intel and VmWare, but I have a feeling that it would bounce right back in my face...
So, my question again; has anybody else seen this? If yes, were you able to solve it in a meaningful way?
/Kenneth