I thought I should post this given all the issues with performance that I read about on the internet, following upgrades from R77.X to R80.X, as it may help some. I should say I’m still waiting for acceptance of the issue and feedback on a fix from Check Point support. Apologies in advance for the length of the post, but it's complicated.
We were initially moving to a new upgraded firewall gateway cluster with new hardware on R80.20 (with the new Gaia 3.10 kernel) from an old cluster on R77.20. However, each time we made the R80.20 cluster live, we started getting reports of performance issues for data transfers and random instances of connections being lost (Reset). Mainly from our 85 remote sites over a COIN, each with their own Sophos XG firewall protecting the local site. Each time we’ve had to revert to the old hardware on R77.20. Almost 10 months later and Check Point support, right up to R&D level have not been able to identify the issue.
The issue, as I initially described it to Check Point, is that we’re seeing an awful lot of ‘first packet isn’t SYN’ messages in the logs and a rapidly increasing cumulative total when you use the cpview command. Closer inspection of packet captures showed that there are significantly more connections resets, when moving to R80.20 (now R80.40) connections are randomly being dropped and this is causing the log entries and first packet isn’t SYN.
Recently I noted that it’s actually the Sophos XG firewalls at our remote sites that are dropping the connections, but oddly only when the R80.X cluster was active, not R77.20. Of course upon me telling Check Point it’s the Sophos XG that’s dropping the connections, they immediately closed the call, with a rather abrupt response.
Subsequently, I believe I've found the issue. The problem appears to be related R80.20 (and newer), incorrectly altering the Selective Acknowledgement (SACK) packets that are used to efficiently resolve packet loss (lost segment) issues. You can see what happens in the packet captures below.
Capture from Internal Interface (Where the server is connected)
Packet Time Source Destination Protocol Length Info
14 2020-07-21 07:52:13.987323 SERVER CLIENT TCP 60 443 → 53954 [ACK] Seq=1896 Ack=2014 Win=35456 Len=0
15 2020-07-21 07:52:14.026565 CLIENT SERVER TCP 1506 [TCP Previous segment not captured] 53954 → 443 [ACK] Seq=3482 Ack=1896 Win=65536 Len=1452 [TCP segment of a reassembled PDU]
16 2020-07-21 07:52:14.027036 CLIENT SERVER TCP 1506 53954 → 443 [ACK] Seq=4934 Ack=1896 Win=65536 Len=1452 [TCP segment of a reassembled PDU]
17 2020-07-21 07:52:14.027050 SERVER CLIENT TCP 66 [TCP Window Update] 443 → 53954 [ACK] Seq=1896 Ack=2014 Win=38272 Len=0 SLE=3482 SRE=4934
Simultaneous Capture from External Interface (Where the client is connected)
Packet Time Source Destination Protocol Length Info
14 2020-07-21 07:52:13.987378 SERVER CLIENT TCP 54 443 → 53954 [ACK] Seq=1896 Ack=2014 Win=35456 Len=0
15 2020-07-21 07:52:14.026552 CLIENT SERVER TCP 1506 [TCP Previous segment not captured] 53954 → 443 [ACK] Seq=3482 Ack=1896 Win=65536 Len=1452 [TCP segment of a reassembled PDU]
16 2020-07-21 07:52:14.027024 CLIENT SERVER TCP 1506 53954 → 443 [ACK] Seq=4934 Ack=1896 Win=65536 Len=1452 [TCP segment of a reassembled PDU]
17 2020-07-21 07:52:14.027060 SERVER CLIENT TCP 66 [TCP Window Update] 443 → 53954 [ACK] Seq=1896 Ack=2014 Win=38272 Len=0 SLE=905734836 SRE=905736288
In packet 17, the SLE (left edge) and SRE (right edge) sequence numbers are altered from when they enter the R80.X firewall to when they leave. The values are not valid sequence numbers (much too high and not previous sequence numbers in this connection) so our Sophos XG firewalls at the remote sites are correctly dropping the packets as invalid.
Today I tried going live with R80.40 in our environment again, but turned off selective acknowledgements in Gaia. Although this obviously isn't the long term fix, everything is working fine, no reports of connections being lost (reset) or performance issues! I await Check Point to acknowledge and fix the issue that's persisted since R80.20 and will be affecting a lot of customers, but at least I have found a workaround in the meantime and hope it helps someone.