Issues with throughput after VSX upgrade from R80....

Kaspars_Zibarts · ‎2021-06-09

Just wondering if anyone else seen any weird issues with total throughput being capped at 2Gbps after upgrade to current T118?

That's on CP appliance 23800

I did not observe any other issues apart from reduced throughput. After restoring T102 snapshot we were back to normal levels way above 2Gbps

We have 3 bonds, all 2x10Gbps, so it feels like somehow they were running at 2x1Gbps for whatever reason.

I didn't do long investigation but basic interface check shows that it should have run 20Gbps on bonds

Jan_Kleinhans · ‎2021-06-10

Hello,

we have latency issues with browsing the web on T118. At the moment workaround is to disable SecureXL on the VS.
Case is open.

Edit: We can limit it to Clients where HTTPS inspection is happening.

Regards,

Jan

Kaspars_Zibarts · ‎2021-06-10

Hehe, VS running over 10Gbps, turning off SXL would be a suicide 🙂

Jan_Kleinhans · ‎2021-06-10

Running at 2GBit/s with same CPU load as with SecureXL turned on doing URLF and IPS. Very funny.

Henrik_Noerr1 · ‎2021-06-11

We went from t91 -> t118 and experience an increased load on several VS fwk threads. In effect many of our VSs have doubled in cpu usage or more. Case open. no blades except firewalling enabled btw.

/Henrik

Timothy_Hall · ‎2021-06-11

Take 100 that you upgraded through was supposed to have a fix that may be related to what you are seeing:

PRJ-15447,
PMTR-55887

VSX

In some scenarios, there may be high CPU utilization in a VSX environment with several instances.

Might be interesting to ask TAC to look specifically at this fix and whether it is working as intended in your environment.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Kaspars_Zibarts · ‎2021-06-11

Problem for us wasn't CPU I'm afraid but heavily reduced throughput. 2x1G instead of 2x10G I would say.

Timothy_Hall · ‎2021-06-11

Kaspars my response about CPU was to Henrik, but that is strange that you seem to be capping right at 2Gbps like that. Are you able to determine what is going on when traffic is bumping that limit? Packet loss? Latency? Jitter? I assume you don't have any CPUs hitting 100% utilization during this capping, and network interface statistics look clean?

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Kaspars_Zibarts · ‎2021-06-11

I'm afraid I didn't get much time to investigate. As soon as I realized that we have a problem, I reverted snapshot on standby and went back in space of 15mins as it was fairly important production firewall. Interestingly no one complained so I assume we only "slowed" down traffic roughly for an hour. So no major noticeable impact. CPU was usual on VSes. But virtual switches showed increased CPU. Apart from that I have no info to go on 😞 which is a shame

Eitan_Gilad-Lug · ‎2021-06-14

Hello Kaspars,

If you have opened SR for this issue, please share with my the number privately.

thanks

Eitan, VP Technical Services

Henrik_Noerr1 · ‎2021-06-16

So the load went back to normal after 48 hours. Apparently it was connections that was not accelerated after the upgrade, but was again after several hours, I guess because new connections were established.

genisis__ · ‎2021-06-11

Hope this is not another bad Jumbo release!

jafara · ‎2021-06-14

Hi @Kaspars_Zibarts,

Thank you for the detailed information.
I am looking into the diff between T102 and T118 trying to identify if there is a possibility for a degredation.

This may take few days, I will keep you updated.

Regards,

Jafar Atili
VSX Core Team leader

genisis__ · ‎2021-06-19

Jafara,

Do you have any update for us? I'm pretty sure we all want these issues resolved, and in fact the QA on the jumbos to be especially scrutinized.

Kaspars_Zibarts · ‎2021-06-19

Hi! I did a new attempt with T118 installation, this time with a small twist: I added extra step after JHF install and node reboot I pushed all topologies and policies. And seem to have done the trick - no more strange 2x1G throughput limitations.

In nutshell:

use CLI CPUSE to install T118 on standby node with reboot at the end
after node has recovered, push all VS (including VS0) topologies and policies from SmartConsole
Failover nodes and repeat the same steps

I will need to observe actual T118 behaviour for couple of days but bandwidth looks OK now.

Rings a beel as there was a similar issue with one of the takes back in R80.30 if I remember correctly when you had to push policy during JHF installations else nothing worked

jafara · ‎2021-06-19

Hi @Kaspars_Zibarts ,

Thank you for your update,
I think it is always good to push policy after installing a newer code in the system.

regarding the VSX configuration push, I can't think of how it could be related to limiting the firewall throughput.

Are we sure the issue we experienced (throughput limit) is 100% a Firewall issue? can't it be related to a 3rd party system?

Thanks,
Jafar

Kaspars_Zibarts · ‎2021-06-20

Yes, I'm 99% sure as it was only FW that changed and we tried both nodes in the cluster and they are located in different datacentres and connected to different physical switches.

As for JHF installation procedure - could you pls confirm that it is CP recommendation to install policies after first node has been upgraded and before cutting over to upgrade other cluster member. That part normally works with JHF installations without need for policy install. It really needs to be documented somewhere then.

jafara · ‎2021-06-22

Hi @Kaspars_Zibarts ,

There's no official recommendation to push policy after JHF upgrade as the policy is being pulled from the Management in the next boot.

However in some very rare scenarios policy push be helpful.

Regarding our case here, we'll take it offline and test it internally.

Thanks,
Jafar

JozkoMrkvicka · ‎2021-06-20

As Check Point can be in some way not really friendly in regards of upgrade, here are steps we are doing while upgrading jumbo or major upgrade:

1. Schedule maintanance window with potentional service outage in case of disaster

2. snapshot of both nodes, backup of both nodes. In case of VSX also management backup, snapshot, export.

3. transfer all backups outside of the box

Steps on current standby member:

4. upgrade CPUSE deployment agent to newest version

5. import + verify + install (if verify passed) hotfix

6. Let the standby member reboot automatically

7. Once standby member is up and running as standby, do all needed healthchecks

8. if all HCs are fine, policy install on both members. Check warnings after policy installation for any suspisous messages

9. HC again

10. Wait 10 minutes and perform failover

11. Ask everyone to do all needed tests if all is running fine (latency, speed, ...)

12. Grace period of 1 week in case some issue will pop-up after XY minutes/hours/days

13. After all is fine with upgraded member, repeat steps 4 - 11 on second member

You have to be paranoid in these times and do as much as possible to avoid service disruptions. If there is some, you can easily failover back while still have possibility to investigate issue with TAC.

Installing the policy should be mentioned in every jumbo SK...

Kind regards,
Jozko Mrkvicka

genisis__ · ‎2021-06-20

I don't see why a policy installation is required. When the VSX node reboots it will pick up the policy from the manager anyway.

If this is required, in my option, this would be a flaw in the product; what if you have 30 VS's, surely the vendor should not expect a policy push to all 30 VS's everytime a jumbo or upgrade is done.

Kaspars_Zibarts · ‎2021-06-20

In principle I agree with you @genisis__ - seems odd that manual policy install is required. If that's the case, then it should be included in CPUSE, not that hard to code to re-apply all policies after reboot. 🙂

In my case, we had only 4 VSes so it was worth the effort to try and it seem to have paid off.

genisis__ · ‎2021-06-20

I've had to do something similar in the past, and with the amount of VS I have it took almost 2hrs.

CPRQ · ‎2021-06-23

Do we need to take snapshot only on VS0? what are the commands and normally how log it take it? Thanks

genisis__ · ‎2021-06-23

snapshot relates to the entire VSX installation, but yes I would be in VS0 clish:

>add snapshot <name> desc "<Description>"

Note: dash's cannot be used in the name

to see progress:

>show snapshots

Once this is completed I would suggest this is exported and stored offline.

JozkoMrkvicka · ‎2021-06-26

In case you are doing software upgrade (from older version to newer version - R80.30 to R80.40), the snapshot is done automatically during upgrade itself (by upgrade process), but this snapshot is stored locally on the upgraded VSX. If you want to have snapshot to be transfered outside of the box, you need to perform manual snapshot (syntax mentioned below).

Once you want to install Jumbo Take, the snapshot is not done automatically and must be done manually.

You can do snapshot from any VS, but it will do snapshot for all VSs, not only for specific VS.

As most of config is on management and not on VSX itself, the best is to perform snapshot of management as well.

Kind regards,
Jozko Mrkvicka

Are you a member of CheckMates?

Issues with throughput after VSX upgrade from R80.40 T102 to T118