Re: VSX appliance upgrade to R80.40 T78 - first im...

Kaspars_Zibarts · ‎2020-10-23

My usual "morning after" report, in case it might help some one.

We were on R80.30 T215 before upgrade running on 23800 appliances that was NOT hyperthreaded before upgrade.

Good stuff:

Really impressed with CPUSE CLI upgrade! Especially considering the complexity - kernel upgrade from 2.6 to 3.10, enabling hyper-threading etc. Well done Checkpoint! I used Multi-Version Cluster (MVC) Upgrade option and it worked like a charm - connections synchronised in the cluster and I was able to failover one VS at a time.

Why did we decided against clean install and vsx_util reconfigure?

easier rollback on the gateway as file system remains EXT3 you are able to use snapshots created prior R80.40. With clean install file system would change to XFS therefore snapshot revert would not work
no need to take care of any customisations i.e.:
- manual CoreXL settings
- non-default IA settings
- scripts
- contents of user folders
- SSH keys and known hosts used by external monitoring

So actual upgrade was a breeze I have to admit!

I do not want to celebrate too early but first indications are that our RX-DRP issues might be cured with a better MQ implementation in 3.10

Potential show-stoppers or things you will need to take care of:

User crontab is reset, so you will have to add it back manually. For us it's a normal procedure anyways for any upgrade, but be mindful
Not 100% sure but for some reason on one box we saw IA nested groups reset to default setting of 20. We have it disabled. Just check it if you have customised it from 20. This is to deal with high CPU utilisation by pdpd
Interface RX ring buffer settings were defaulted during upgrade. We were forced to increase it in R80.30 due to noticeable RX-DRP presence that affected Teams voice
SNMP v3 stopped working after upgrade leaving us pretty much blind without any graphs to assess R80.40 performance properly. Major problem if you ask me. @Friedrich_Recht 💪 saved my night - here's link to the VSX SNMP v3 workaround , TAC case still open with CP for permanent fix
SNMP OID ifDescr (1.3.6.1.2.1.2.2.1.2) has changed from interface name to interface card description. It is actually "correct" move but it "broke" our monitoring systems i.e. good old MRTG as it used ifDescr to fetch interface index therefore after upgrade it failed to match interface name to an index:
- R80.30. iso.3.6.1.2.1.2.2.1.2.2 = STRING: "Mgmt"
  
  R80.40: iso.3.6.1.2.1.2.2.1.2.2 = STRING: "Intel Corporation I211 Gigabit Network Connection
MultiQueue manual settings will be replaced with default Auto. Left it at that for now, seems to do a good job
Unable to display NAT table (fwx_alloc) on a busy VS. Only cpview works from R80.40 onwards (sk156852 ). But I'm unable to pull stats using SNMP as described in SK - only VS0 seems to be supported
FQDN domain object issue (added 29/10). Description and workaround available here FQDN objects allow many unrelated IPs

LAST WORD: I personally would not recommend to deploy R80.40 on VSX with current take 78 in critical production environments due to too many issues with SNMP v3 as you loose service and performance visibility. Unless you need to resolve interface performance related issues i.e. RX buffer overflows that are causing operational problems. I will review and update this when we deploy next take

Kaspars_Zibarts · ‎2020-10-23

Adding:

8. SNMP OID ifHighSpeed (.1.3.6.1.2.1.31.1.1.1.15) and ifSpeed (.1.3.6.1.2.1.2.2.1.5) are set to zero for bond interfaces, affects our monitoring system as it relies on this info to read 32 or 62 bit counter:

for example in R80.40

iso.3.6.1.2.1.2.2.1.2.38 = STRING: "bond2"
iso.3.6.1.2.1.2.2.1.5.38 = Gauge32: 0

iso.3.6.1.2.1.31.1.1.1.1.38 = STRING: "bond2"
iso.3.6.1.2.1.31.1.1.1.15.38 = Gauge32: 0

compared to R80.30:

iso.3.6.1.2.1.31.1.1.1.1.62 = STRING: "bond1"
iso.3.6.1.2.1.31.1.1.1.15.62 = Gauge32: 20000

iso.3.6.1.2.1.2.2.1.2.62 = STRING: "bond1"
iso.3.6.1.2.1.2.2.1.5.62 = Gauge32: 4294967295

Kaspars_Zibarts · ‎2020-10-23

And those looking for performance improvements with MQ and interface discards, here's a little teaser for difference between 2.6 and 3.10 kernel. Same HW.. 10Gbps interface loaded to approx 5Gbps average load but high short bursts on top

Magnus-Holmberg · ‎2020-10-23

damit i was just about to change to 1.3.6.1.2.1.2.2.1.2 as OP5 seams to f* up the interface names all the time.
More or less it changing each time its rebooted so description was a way for me to resolve it..

Great work @Kaspars_Zibarts

https://www.youtube.com/c/MagnusHolmberg-NetSec

Kaspars_Zibarts · ‎2020-10-23

ifDescr SNMP OID change is actually documented here: sk168601

Kaspars_Zibarts · ‎2020-10-26

Seems like FWK CPU usage has gone up in R80.40 accross all VSes by approx 20%. No change in SXL/F2F split so this is pure CPU increase on FWK

Kaspars_Zibarts · ‎2020-10-29

CPU usage increase was "fixed"after FQDN object misbehaviour (see point #8) workaround was implemented1! Yay

Kaspars_Zibarts · ‎2021-03-03

SMALL UPDATE

We upgraded one more VSX cluster this time to T91 so i need highlight three potential issues and one documentation note:

increased CPU usage, approx +10-20%. Solution is still the same - DNS passive learning that's enabled by default. Disabling will reduce the CPU but you will lose additional functionality, especially for O365 updatable object for domains with wildcards not the case! pls ignore. DPL is working without CPU impact. New suspect has arrived, working on details!
captive portal not working, fix is in sk170433
loss of web traffic for approx 5min after cutover, the root cause was updatable objects including O365 services were not properly initialised so we had to do manual kick described here sk121877. As soon as we run unified_dl UPDATE ONLINE_SERVICES command on corresponding VS, UO populated and all started working. We have added now additional check in the procedure
upgrade manual is incorrect for MVC cluster section and does not tell you to run vsx_util upgrade before running CPUSE upgrade on gateways. It is correct for single VSX gateway though.

Hope it helps someone else!

IdanC · ‎2021-03-15

Regarding crontab - When adding the jobs using Gaia (Web portal - Job Scheduler / Clish command add cron job) they are preserved after CPUSE Upgrade

Kaspars_Zibarts · ‎2021-03-16

Indeed, this is about tasks that are run more often than possible with job scheduler

Are you a member of CheckMates?

VSX appliance upgrade to R80.40 T78 - first impressions