Solved: Cloudguard, AWS, gwlb - first packet isnt syn.

vinceneil666 · ‎2022-11-22

Hi ! 🙂

I have set up a scaleset in AWS, using gateway loadbalancers. I pretty much ran the cloudformation template - did my on prem management setup (cme) and puit on some firewall rules. Solution was working fine - we have lots of traffic moving over the setup, and it gets distributed fine by the loadbalancers.

Then we got a very strange issue, related to SAP traffic. We have several SAP clients connecting to a server, and as long as they are working its fine - but as soon as they idle for 7-8-9 minutes the connection breaks and they have to re-initiate the connection.

The firewall logs are full of 'first packet isnt syn' relted to this traffic. So we put down lots of hours checking and verifying everything along the way - do note that this was working fine before we introduced the cloudguard scale set.

After checking application, routers, other firewalls and the scaleset of course. I am unable to find any error - I can only verify that the packets are hitting the firewall and that they stop flowing when SAP is idleing. We consider the tuple settings for the gwlb, and misc timeout settings etc etc..

The scale set is set up as minimum2, maximum 2 - we change this to minimum 1, and then kill the one cloudguard firewall. This leaving us with a scale set with only one active cloudguard firewall -- the minute we do this, everything starts working.

Can anybody point me in the right direction on what to do next ? -- My initial thoughts is that when I ran cppcap on all nodes to verify that the traffic was entering the correct firewall even after idling, I have somehow missed a packet hitting the other firewall, since cppcap will not show dropped packets.

Can this be related to the tuple setting on the gwlb ? Or might i be some kind of timeout on the gwlb that will move the session over to the other cloudguard after 4 minutes or so ? Do anyone have any experience on this ?

The enviroment is running r80.40 - since that is the only version supporting the gwlb and GENEVE protocoll as of now. (I see that r81.20 just released, so it might be supported there - but an upgrade as it stands now is out of the question.)

Shay_Levin · ‎2022-11-22

Some applications or API requests, such as synchronous API calls to databases, have long periods of inactivity. GWLB has a fixed idle timeout of 350 seconds for TCP flows and 120 seconds for non-TCP flows. Once the idle timeout is reached for a flow, it is removed from GWLB’s connection state table. As a result, the subsequent packets for that flow are treated as a new flow and may be sent to a different healthy firewall instance. This can result in the flow timing out on the client side. Some firewalls have a default timeout of 3600 seconds (1 hour). In this case, GWLB’s idle timeout is lower than the timeout value on the firewall, which causes GWLB to remove the flow without the firewall or client being aware it was dropped.

To prevent this from happening, we recommend configuring the TCP keep-alive setting to less than 350 seconds on either client/server’s application/Operating System (OS) or update your firewall’s timeout settings to less than 350 seconds for TCP and less than 120 seconds for non-TCP flows, as shown in figure 1 below. This will ensure the client/server keep the flow alive if there is inactivity or the firewall removes the session before GWLB.

https://aws.amazon.com/blogs/networking-and-content-delivery/best-practices-for-deploying-gateway-lo...

View solution in original post

_Val_ · ‎2022-11-22

@Shay_Levin can you advise?

Shay_Levin · ‎2022-11-22

Some applications or API requests, such as synchronous API calls to databases, have long periods of inactivity. GWLB has a fixed idle timeout of 350 seconds for TCP flows and 120 seconds for non-TCP flows. Once the idle timeout is reached for a flow, it is removed from GWLB’s connection state table. As a result, the subsequent packets for that flow are treated as a new flow and may be sent to a different healthy firewall instance. This can result in the flow timing out on the client side. Some firewalls have a default timeout of 3600 seconds (1 hour). In this case, GWLB’s idle timeout is lower than the timeout value on the firewall, which causes GWLB to remove the flow without the firewall or client being aware it was dropped.

To prevent this from happening, we recommend configuring the TCP keep-alive setting to less than 350 seconds on either client/server’s application/Operating System (OS) or update your firewall’s timeout settings to less than 350 seconds for TCP and less than 120 seconds for non-TCP flows, as shown in figure 1 below. This will ensure the client/server keep the flow alive if there is inactivity or the firewall removes the session before GWLB.

https://aws.amazon.com/blogs/networking-and-content-delivery/best-practices-for-deploying-gateway-lo...

vinceneil666 · ‎2022-11-22

Thanks - that was the issue. We where able to implement a keepalive feature in the SAP application, so that will keep the session alive past the 6 minutes of the load balancer.

As far as I can see, the old classic load balancer had an option of tuning the timeout - but I guess that is not an option for the gateway load balancers.

abihsot__ · ‎2023-06-13

isn't this nonsense of having keepalive default value so low (350 seconds) and drop connection silently? Operating systems usually have default of 7200 seconds

Are you a member of CheckMates?

Cloudguard, AWS, gwlb - first packet isnt syn.