R80.40 NAT port exhaustion: why do cluster members...

johnnyringo · ‎2022-02-02

Saw a few warnings/errors today on a specific R80.40 gateway regarding NAT pool exhaustion. This showed up before with R80.30 since we have a source NAT hide rule for traffic from Internet coming to the application.

I went to the gateway and ran a cpview, then looked under Advanced -> NAT. The problematic gateway showed a High Port capacity of 66. The other gateway in the cluster showed 16533, which seems to be the normal value

I've also confirmed this walking SNMP OID 1.3.6.1.4.1.2620.1.56.1301.3.1.8 One cluster member shows a value of 16533, the other shows a number in the mid-60s. Did the same for some other R80.40 gateways and the numbers were always the same.

Very confused why this would be. I do understand the port allocations changed in R80.40 but would certainly expect each member to show the same amount of capacity. I've replicated this in a lab setup and found it's consistent for R80.40, and failing over the firewalls had no effect on the numbers reported.

Chris_Atkinson · ‎2022-02-02

What model gateway is used here out of interest?

CCSM R77/R80/ELITE

johnnyringo · ‎2022-02-02

CloudGuard IaaS (High Availability) on Google Cloud Platform

Chris_Atkinson · ‎2022-02-02

I presume with fewer than 5-cores assigned to each instance, has GNAT been enabled manually?

CCSM R77/R80/ELITE

johnnyringo · ‎2022-02-02

No, everything should just be running a factory-default configuration.

johnnyringo · ‎2022-02-02

Just to be sure, I've verified GNAT is disabled on both members:

[Expert@cp-member-a:0]# modinfo -p $FWDIR/boot/modules/fw_kern*.o | sort -u | awk 'BEGIN {FS=":"} ; {print $1}' | xargs -n 1 fw ctl get int | grep gnat_

enable_cgnat_hairpinning = 0
fwx_cgnat_sync_table = 0
fwx_gnat_enabled = 0


[Expert@cp-member-b:0]# modinfo -p $FWDIR/boot/modules/fw_kern*.o | sort -u | awk 'BEGIN {FS=":"} ; {print $1}' | xargs -n 1 fw ctl get int | grep gnat_

enable_cgnat_hairpinning = 0
fwx_cgnat_sync_table = 0
fwx_gnat_enabled = 0

the_rock · ‎2022-02-02

Not sure if below might apply, but worth checking quickly.

Andy

https://supportcenter.checkpoint.com/supportcenter/portal?action=portlets.SearchResultMainAction&eve...

Best,
Andy

johnnyringo · ‎2022-02-02

Yeah already read that last year due to having a similar issue in R80.30 where we hit NAT exhaustion at around 1200 connections (nowhere near the 16,335 capacity). TAC could never explain why. In this case, question is why the capacity on two different cluster members reports 66 vs. 16,335.

This thread is interesting: R80.40 GNAT issue after Upgrade

But, these were fresh R80.40 deployments. Also, I ran the one-liner and verified fwx_gnat_enabled = 0 is on both members

the_rock · ‎2022-02-02

I saw someone post below link that TAC gave them when they had same issue, but can't recall what they ended up changing from the sk. Let me see if I can find that post.

https://supportcenter.checkpoint.com/supportcenter/portal?eventSubmit_doGoviewsolutiondetails=&solut...

Best,
Andy

johnnyringo · ‎2022-02-03

Well TAC just replied. What a surprise, it's a bug. Never saw that coming! 😂

sk177228: /var/log/messages Is Flooded on the Standby Member with the Log 'allocate_port_impl: Could...

Currently the custom Hotfix is only available on Take 120. Were on a mix of Take 125 and 139.

I really wonder if just force-enabling GNAT is the better solution. I still don't understand why it would only be abled for 5 vCPUs or higher.

Are you a member of CheckMates?

R80.40 NAT port exhaustion: why do cluster members show vastly different high port capacity?