Solved: First impressions R80.30 on gateway - one step for...

Kaspars_Zibarts · ‎2020-01-18

Ok, we were finally "forced" to go ahead and upgrade our gateways from R80.10 to R80.30 for fairly small things - we wanted to be ale to use O365 Updatable Object (instead of home grown scripts) and improve Domain (FQDN) object performance issues when all FWK cores were making DNS queries causing a lot of alerts (see https://community.checkpoint.com/t5/General-Topics/Domain-objects-in-R80-10-spamming-DNS/m-p/19786)

Positive things - upgrades were smooth and painless - both on regular gateways and VSX.

All regular gateways seems to be performing as before, but I have to be honest that they are "over-dimensioned" and having rather powerfull HW for the job - 5900 with 16 cores.

VSX though threw couple of surprises.

SXL medium path usage. CPU jumped from <30% to above 50% on the busiest VS that only has FW and IA blades enabled. Ok, there is also VPN but only one connection:

I haven't spent enough time digging into it but for some reason 1/3 of all connections took medium path whereas before in R80.10 it was nearly all fully accelerated. And most of it was HTTPS (95%) with next most used LDAP-SSL (2%)

I used the SXL fast accelerator feature (thanks @HeikoAnkenbrand https://community.checkpoint.com/t5/General-Topics/R80-x-Performance-Tuning-Tip-SecureXL-Fast-Accele...) to exclude our proxies and some other nets and you can see that on friday CPU load was reduced by 10% but nowhere near what it used to be.

I just find it impossible to explain why would gateway with only FW blade enabled start to to throw all (by the looks of it) traffic via PXL. And statistics are a bit funny too:

FQDN alerts in logs. I can definitely confirm that only one core now is doing DNS lookups (against all DNS server you have defined, in our case 2). But we are still getting a lot of alerts like these: Firewall - Domain resolving error. Check DNS configuration on the gateway (0)

Especially after I enabled updatable object for O365 in the rulebase.

As said before - I have not spent too much time on this as we had other "fun" stuff to deal with on our chassis, so it's fairly "raw". I will report more once I had some answers

Kaspars_Zibarts · ‎2020-01-19

I got a "tip off" from inside CP! Verifying if I'm allowed to publish it here but seems like my PXL issue is resolved! Yeehaa! Power of community! Thanks to @Ilya_Yusupov

And the "secret stuff" here:

Regarding medium path – you see most traffic in medium path due to a known bug we have since R80.20, TLS parser is enabled when the following combinations of blades are enabled

FW + IDA or/and Monitoring or/and VPN (exactly our case!)

You can validate me by running the following command - “fw ctl get int tls_parser_enable” it will bring 1
As WA you can disable it by running the following on the fly - “fw ctl set int tls_parser_enable 0” è for permanent disabled put it under $FWDIR/boot/modules/fwkern.conf tls_parser_enable=0 and reboot.
The above will bring the traffic to be fully accelerated as in previous version.

View solution in original post

Kaspars_Zibarts · ‎2020-01-18

Just had a closer look at the IPs that are being sent to medium path and all points to O365 / MS.

Strange as O365 object is fully removed now from rules and DB.

Kaspars_Zibarts · ‎2020-01-18

Ouch! Penny just dropped, not even sure how I overlooked the fact that CPUSE upgrade changed our hyper-threading from OFF to ON but (!) kept original manual affinity settings. So not surprising that CPU usage was screwed. I.e. our multiqueue was running on 6 "half" cores instead of 6 "full"! etc etc

Something to watch out for if you are using manual affinities on VSX!

HeikoAnkenbrand · ‎2020-01-18

Hi @Kaspars_Zibarts

The Fast Acceleration (picture 1 green) feature lets you define trusted connections to allow bypassing deep packet inspection on R80.20 JHF103 and above gateways. This feature significantly improves throughput for these trusted high volume connections and reduces CPU consumption.

During my tests, I could reduce CPU (Core) usage about 10%-30%. It is also logically, no more content inspection is executed.

I like that you showed that graphically.👍

fast_accel_3 (1).PNG

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips

Timothy_Hall · ‎2020-01-18

Hi Kaspars,

Thanks for your report, a few comments:

1) What do you have set in the Track field of your rules? If using Detailed or Extended logging this can pull traffic into PSLXL to provide the extra detail being requested. Found out about this one while writing the third edition of my book.

2) Do you have any services with Protocol Signature enabled in the Network/Firewall policy if using ordered layers, or in the top level of rules if using inline? This can also cause some of what you are seeing and you should try to stick to simple services (just a port number) in those layers if possible, then call for Protocol Signatures and applications/URLs/content in subsequent layers.

3) As far as that wacky Accelerated Conns percentage, you must have very large amount of stateless traffic, see sk109467: 'Accelerated conns' value is higher than 'Accelerated pkts' in the output of 'fwaccel stat....

4) As you noticed the gateway is much more dependent on speedy DNS starting in R80.20 due to Updatable objects, rad, wsdnsd and a lot of other daemons.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

Kaspars_Zibarts · ‎2020-01-19

I got a "tip off" from inside CP! Verifying if I'm allowed to publish it here but seems like my PXL issue is resolved! Yeehaa! Power of community! Thanks to @Ilya_Yusupov

And the "secret stuff" here:

Regarding medium path – you see most traffic in medium path due to a known bug we have since R80.20, TLS parser is enabled when the following combinations of blades are enabled

FW + IDA or/and Monitoring or/and VPN (exactly our case!)

You can validate me by running the following command - “fw ctl get int tls_parser_enable” it will bring 1
As WA you can disable it by running the following on the fly - “fw ctl set int tls_parser_enable 0” è for permanent disabled put it under $FWDIR/boot/modules/fwkern.conf tls_parser_enable=0 and reboot.
The above will bring the traffic to be fully accelerated as in previous version.

Timothy_Hall · ‎2020-01-19

Wow nice one Kaspars, don't think I would have ever figured that one out. Will disabling the TLS parser as shown cause issues with other blades should they get enabled later?

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

Kaspars_Zibarts · ‎2020-01-20

@Timothy_Hall as far as I understood R&D are working on proper long term solution to fix it.

As for FQDN alerts, now I can confirm that O365 updatable object is definitely causing it but only on our busy VSX. I haven't seen the same issue on regular gateways.

According to CP, alert is issued when resolver cannot get response to checkpoint.com query. I took a tcpdump and confirmed that DNS is actually responding but it does generate wsdnsd log, here's example of packet capture and matching wsdnsd.elg entry:

[wsdnsd 32546]@vsx1-ext[20 Jan 9:10:33] Warning:cp_timed_blocker_handler: A handler [0xf6f213d0] blocked for 44 seconds.
[wsdnsd 32546]@vsx1-ext[20 Jan 9:10:33] Warning:cp_timed_blocker_handler: Handler info: Library [/opt/CPshrd-R80.30/lib/libResolver.so], Function offset [0x2b3d0].
[wsdnsd 32546]@vsx1-ext[20 Jan 9:10:33] Warning:cp_timed_blocker_handler: Handler info: Nearest symbol name [_Z10Sock_InputiPv], offset [0x2b3d0].

Still digging through my packet capture to see if i can find any strange names / responses etc

Ilya_Yusupov · ‎2020-01-20

@Timothy_Hall - Indeed when you will enabled blades that will require tls parser you will need to remove the WA i suggested.

The WA is currently only for the combinations i sent.

Timothy_Hall · ‎2020-01-20

That makes sense, thanks. Will add this workaround to the upcoming R80.40 addendum but be careful to add caveats for which blades are enabled.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

Timothy_Hall · ‎2020-08-26

Note that the long-term fix for the TLS parser being inappropriately invoked with certain blade combinations has been fixed in R80.40 Jumbo HFA Take 78+. This fix is also going to be backported into R80.20 and R80.30 Jumbo HFAs as well as mentioned in my R80.40 addendum for Max Power 2020. It is always preferable to have this fix present if possible rather than manually tampering with the TLS parser, as doing so can cause further problems.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

rob123 · ‎2020-01-19

👍

Kaspars_Zibarts · ‎2020-01-22

Just a quick update on FQDN object alerts.

All was caused by missing rule that would permit DNS requests using TCP from gateway. I have added full details at the corresponding thread about FQDN here:

https://community.checkpoint.com/t5/General-Management-Topics/Domain-Objects-FQDN-An-Unofficial-ATRG...

Khalid_Aftas · ‎2020-06-14

we upgraded to 80.30 with latest HF, and even with disabling the parser (we dont use other blade) 49% of trafic is going trough medium path, how can we check further ?

Case atm.

Timothy_Hall · ‎2020-06-14

Output of enabled_blades please.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

Khalid_Aftas · ‎2020-06-14

FW,Identity Awerness, IPS.

Got info from TAC (canada) that tcp/445 and https will use medium path on 80.30 regardless of PSL parser disabled or not.

Kaspars_Zibarts · ‎2020-06-15

That's correct - file shares (445) is forced via PXL. There's a procedure available to exclude it but it's fairly complex from memory

Khalid_Aftas · ‎2020-06-15

well we used fast_accel to bypass the big flows we had and PXL % droped to 20% now.

That traffic is now overloading SND core .... and fwk are still quiet high....

Khalid_Aftas · ‎2020-10-14

More than 4 months laters, multiple sessions with TAC, and still no solutions yet (one special hotfix that did not help), this seems to be a known issue and TAC is not able to sync with r&d to get a fix ? wondering if someone in the community was able to get a fix at the end.

Ilya_Yusupov · ‎2020-10-14

Hi @Khalid_Aftas ,

the fix exist in R80.30 JHF take 219 and in R80.20 JHF 141.

Thanks,

Ilya

Khalid_Aftas · ‎2020-10-15

Hi Illya, are you referring to The fix from sk166700 is integrated into take 219 of the jumbo for R80.30 as (look for the article or PRJ-14368, PRJ-15747, or PRHF-10818 in sk153152) Because in case 6-0002031094 we had a private fix for those on our current JHF. I would like to avoid doing a change and giving hope to my client for nothing. Thanks a lot

Ilya_Yusupov · ‎2020-10-15

Hi @Khalid_Aftas ,

Yes I'm referring tp fix described in SK166700.

Thanks,

Ilya

Khalid_Aftas · ‎2020-11-09

Hi Illya, after JHF 219 installation, the issue is still the same, i still need to force use fastaccel to accelerate tcp 443/445 otherwise the fwks are saturated.

Ilya_Yusupov · ‎2020-11-09

Hi @Khalid_Aftas ,

You mention port 445 which is SMB protocol not sure it is same issue as i'm sure the TLS parser issue described in the SK solved in JHF 219.

For SMB protocol if you have majority of such traffic the only possible solution i'm aware of is putting in under fast_accel.

Did you tried to put only 445 under fast_accel?

Khalid_Aftas · ‎2020-11-09

I used fast accel with combination of specifics source/destination and smb (the top talkers) and saw a decrease directly.

Using fastaccel for all smb traffic is another story, with this workaround all the security features are useless ...

Is this a known limitation on r80.30 ? that SMB traffic is not accelerated ? because we had 0 issues on 77.30

Ilya_Yusupov · ‎2020-11-09

Hi @Khalid_Aftas ,

As far as I know in R77.30 we had same behavior, I suggest to open TAC case for further investigation.

Thanks,

Ilya

Are you a member of CheckMates?

First impressions R80.30 on gateway - one step forward one (or two back)