Re: Terrible performance IPHTTPS - Checkpoint SG 1... - Page 2

stefan_am · ‎2021-06-04

Long time lurker, first time poster.

I work at an organization which have deployed DirectAccess as it's remote access solution time and time again we've had people complain about the terrible performance. I'm head of our our digital infrastructure unit.

I get that some performance impact is expected when deploying other vendors VPN solutions is expected, but as it stands now we're looking to replace our 15600 clusters with off-the-rack consumer products.

Our ISP delivers dual 10Gbit connections to our 10Gbit firewalls, the deployment is the same in our two datacenters.

The 4 DirectAccess servers with 10Gbit interfaces going through this solution have reported an aggregated peak throughput since the start of the pandemic at 461 Mbit with 1300 users connected.

On average we're seeing somewhere around 400Mbit aggregated throughput for about 1200 users.

After a year of troubleshooting we finally decided to completely remove the 15600 and see if the DirectAccess(IPSec tunnels) was the root cause, but immidiately bottlenecked the 10Gbit interface on the servers with users getting up to 890Mbit throughput.

During normal operations a single user can sometime peak it's connection through our 15600 at 40-60 Mbit, but on average looks at about 0,3-0,5Mbit throughput.

Is this expected behavior from this Security Appliance? Using the 15600 we're getting a 99.97-99.98% performance drop in our network. I say to my network team that this can't be working as expected, but they claim both our support organization and CheckPoint TAC says there is nothing wrong with our setup. I haven't personally spoken to a CheckPoint rep since our support agreement only gives me access to the "experts" at our support organization.

If this really is how terrible the performance is in this regard I honestly can't understand why anyone would consider using CheckPoint as a vendor getting 3 cents on the dollar in a best case scenario.

Do I just turn the page and go with another vendor for our next investment or is there an actual motivation to why our setup is struggling this bad?

Vladimir · ‎2021-06-18

What IP do you have in the NAT properties of the DA server object itself?

Is the NAT rule you are describing a manual one, or automatically created?

Could the IPs 123.10.20.110 and the network 123.10.20.0 be remnants of the old ISP range?

Last one (for this post:)): are there any load balancers in the picture that are not mentioned?

stefan_am · ‎2021-06-18

Thank you very much for taking the time:) I really don't mind a million questions to resolve this!

123.10.20.110 is a manually created object which is added to a manual NAT rule. The actual server 10.10.20.110 is the target destination in that rule at that object does not have any NAT properties at all. I just typed 123.10.20.110 as an example, our actual address is something else but it's an address for our own B-range that we own. (123.10.0.0/16) We do have a bunch of different rules for other services on 123.10.21.0, 123.10.22.0 and so on, but all those are defined subnets and I can find bonds that contain those subnets when looking at the interfaces.

Vladimir · ‎2021-06-18

I am puzzled then in how the inbound traffic actually getting to that server at all if its actual IP is in RFC 1918 range and it does not have proper public IP assigned to it while at the same time there is no accurate manual NAT (Static) rule is defined for it.

Or is it actually defined, but the one you are describing is simply one more rule referencing same internal IP?

If the last statement is true, is the correct manual rule higher or lower than the one we are discussing?

stefan_am · ‎2021-06-18

Getting kinda excited that we're finding illogical things, almost like we're getting closer to _a_ problem 🙂

We have a network object when I look in object explorer called N123.10.20.0_24 which contains that subnet, and then a bunch of host objects called H10.10.20.10X and so on which actual IP-addresses are 123.10.20.10, there is about 15 of them. 2 of them are the DirectAccess servers in this DC.

I can't find any references in the cluster or the nodes apart from our Security Policy and NAT policy that refers to this subnet or any hosts in this.

Would it be possible that we have a static route outside of the firewall from the ISP that define this subnet with our firewalls external IP-address as it's gateway?

The NAT rule seems to be working fine, I just read it again and it says "ANY SOURCE -> TARGET H10.20.10.110X(which as I mentioned is 123.20.10.110) -> Original source -> Translated target H10.20.10.110"

PhoneBoy · ‎2021-06-18

Assuming your gateway is receiving traffic destined for these IPs, then it must be the result of an upstream route.
I don’t believe that’s uncommon.

stefan_am · ‎2021-06-18

Okay thanks, but since we have no route defined for the subnet cause the traffic to be routed to our default gateway? Would that happend before or after the rules are evaluated?

Vladimir · ‎2021-06-18

If you are using ISP's routers at the boundary and not your own, this may be the case.

Generally, if you have such a large public range, I would not expect the entire /16 being forwarded to the public IP of the cluster.

This is just in case you want to split the network on the routers and terminate different smaller networks on different public facing devices.

Please let me know if your cluster's public side has a mask of /16.

stefan_am · ‎2021-06-18

We do not have /16 defined, we do however have about 4 different subnets in each DC with /24 on the public side of each cluster.

Vladimir · ‎2021-06-18

If that is the case, then your ISP is forwarding suspect123.20.10.0/24 to the public IP of your cluster in question, (or, at least, it should).

In this case, I would expect to see proxy arp entries for each host that has static NAT in that range.

I would also expect to see the reverse NAT rule under the one you have described for return traffic.

If you can, run a traceroute to that IP and see if it even terminates on the correct cluster, or if it is bouncing off another one.

stefan_am · ‎2021-06-18

There are no reverse NAT rules on these hosts.

[Expert@dc1-edge1:0]# traceroute 123.10.20.110
traceroute to 123.10.20.110 (123.10.20.110), 30 hops max, 40 byte packets
1 123.10.19.252 (123.10.19.252) 2.256 ms 2.373 ms 2.396 ms
2 * * *
3 * * *
.....
29 * * *
30 * * *

There is no route defined to this network on the cluster. 123.10.19.252 is our ISPs address on the router where they hand off our internet access.

Vladimir · ‎2021-06-18

Apologies, I should have clarified: traceroute to that IP from outside. Any remote host would do.

If you are using proxy ARP, you can accept or originate traffic from that cluster even if there are no routes for this network, so long as your ISP is forwarding /24 this IP belongs to to the public IP of the cluster.

But:

unless you have a compelling reason NATing that server to the 123.10.20.110, (i.e. you may have a DNS records for it) I would suggest defining its NAT properties using an unused public IP from the /24 that your cluster is using. Disabling the manual NAT rule in question (new, automatic NAT rule pair will be created once public IP is defined in the properties of the object).

Creating two new NAT bypass rules:

Internal or DMZ to 10.10.20.110(DA-server) original original

10.10.20.110(DA-server) to Internal or DMZ original original

...and checking how things work then.

stefan_am · ‎2021-06-18

Thank you for your suggestion, I won't be changing anything on a friday night to ruin my weekend tho 🙂

When running a trace from home I can see the ISP router before packets starts to drop, we only allow https through on that IP atm. We do have a DNS record for this but there really is no compelling reason using that specific address, it can easily be changed. To answer a previous question we did have a load balancer in this deployment previously, but it has been removed right now to remove complexity from the troubleshooting.

I will get back to this on monday with my team, thank you so much for your help.

Vladimir · ‎2021-06-18

Welcome and good luck for Monday's attempt!

If you are logging ICMP, you may see the drops on the firewall from that traceroute attempt.

You may also temporary create a specific rule allowing it from your home public IP.

If my suggestion works and if you are to re-introduce the load balancer, some changes may be required.

Have a nice weekend,

Vladimir

stefan_am · ‎2021-06-20

Put up a brand new DirectAccess server beside our old deployment and created an object with NAT on it, then negated the rules for internal traffic as per your suggestion but still get the same throughput from the internet.

We have 6 active interfaces in 3 bonds, bond1(Internet), bond2(DMZ), bond3(LAN). When connecting a DA-client on a VLAN on bond2 and bond3 we get the clients line speed on all tests. The only difference I can see is the NAT when coming from bond1, but another "funny" thing is that upload we get about 3 times the performance on a VPN client than we do when downloading.

Vladimir · ‎2021-06-21

@stefan_am Thank you for the update.

In order to exclude all possible factors that may be affecting this traffic, let's check if:

1. All three bonded interfaces are connected to either the same L2 fabric partitioned into three isolated VLANs or are connected to the identical switches deployed in identical manner.

2. Please describe how the load balancing on each bond is configured (i.e. LACP or XOR and if L2/L3-L4 is used).

3. Check the error count on external interfaces to see if it is present and if it is, if its growing.

4. If there are no errors or the count is stable, test from DA-client connected to the external VLAN to take the path from normal client via the Internet and your ISP out of equation.

5. Is HTTPS inspection enabled? If yes, is the traffic in question in a bypass rule?

6. Are the firewalls configured to work in IPv4 only or IPv4 and IPv6?

7. are there any rules in the policy specifically targeting IPv6 tunneling?

stefan_am · ‎2021-06-21

Hi!

1. All 3 bonds are connected to physical different devices with both nics. Bond1 connects to dual Cisco 3850, and bond2 and 3 connect to different Nexus switches.

2. I take it you mean operation mode, and all 3 are configured to use 802.3ad, using Layer 2 Transmit Hash Policy and Slow LACP Rate. MTU 1500, 100ms Monitor, 200 ms Down and 100ms Up.

3. Checked the error count and it's steady at 0 on the bonds.

4. I don't really get what you mean, take what out?

5. HTTPS inspection is disabled, but we do have a bypass rule for these IP addresses.

6. Firewalls are configured for IPv4 only.

7. There are no rules apart from "Allow https" to these servers NAT address.

Vladimir · ‎2021-06-21

For 4:

If you are presently testing DA-clients performance from either internal or DMZ itself, you are not traversing multiple hops to get to DA server. So I am suggesting testing the DA-client connected to one of the ports on your Cisco 3850s, assigning it one of the available public IPs and testing the performance from it.

If you are comfortable with it, can you kill one of the slave interfaces on the external bond and test it in that state?

stefan_am · ‎2021-06-21

I'm sorry if I haven't been clear but we didn't have any other readily available ports to connect from so we connected our client in a port on the 3850 and that's where we did our external test. Just untagged the VLAN and used an IP-address 123.20.10.50 with the firewall as default gw.

We might be able to do these tests with the second interface disabled with a bit of planning, but we've currently only tried it with everything online.

Vladimir · ‎2021-06-21

Thank you for the clarification. Your approach with testing the access from outside is the valid one (provided it was connected to the 3850 on which current active cluster member is terminated).

If the test on a single slave interface would not yield good results (provided we can try it on both interfaces to see if they behave differently), I am afraid this would exhaust all the ideas I have to date as to the cause of the issue.

I am almost sure that it is a TCP window scaling issue at its root. To either prove or disprove this theory, you may want to perform packet capture for the session from external zone.

The overall slowness from outside may be hinting at MTU size mismatch, but because it is also asymmetrical, I would expect to see something like duplex mismatch as well. You not seeing drops or errors seem to contradict this theory.

Vladimir · ‎2021-06-28

@stefan_am , did you get a chance to test that last suggestion?

Are you a member of CheckMates?

Terrible performance IPHTTPS - Checkpoint SG 15600