Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
HeikoAnkenbrand
MVP Gold
MVP Gold
Jump to solution

Performance Limitations of Virtual Switches in a VSX Environment

I’ve noticed that in a VSX environment the virtual switches don’t seem to achieve very high packet throughput.
In practice, I can’t get more than around 4–5 Gbps over a wrp interface. When I connect the same setup using physical switches + 100Gbps transceiver and enable multiqueueing, I can reach about 80 Gbps on a 100 Gbps interface .

The setup I tested consists of two scenarios with two 29000 appliances for VSX LS:
In the first one, Virtual System 1 is connected to Virtual System 2 through a virtual switch within the VSX environment.
In the second scenario, Virtual System 1 and Virtual System 2 are connected through a physical switch

This allows me to directly compare the performance of traffic flows when using virtual switching versus physical switching.

So my questions are:

  • What are the limitations of virtual switches in VSX regarding throughput?

  • Are virtual switches vs. wrp (Warp Link) capable of using multiple cores for packet forwarding?

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips
2 Solutions

Accepted Solutions
Gera_Dorfman
Employee
Employee

There should be no performance degradation. The issue is specific to UPPACK and has been identified as a bug. The fix is targeted for inclusion in one of the upcoming JHF releases. If an urgent fix is required, then please open an SR, and we’ll work on a dedicated hotfix.

View solution in original post

Gera_Dorfman
Employee
Employee

@genisis__ @HeikoAnkenbrand 

Answering all questions in one post 

- The virtual switch should not impact performance at all. If you achieve 100G without the virtual switch, you should expect the same performance with the virtual switch. The current degradation issue lies in the UPPACK implementation of the virtual switch and we need to fix it.


- The fix will be included in one of the upcoming Jumbo releases, but not necessarily the next one. If an urgent fix is needed, please contact support, and we'll arrange a hotfix.

- The bug to track is PMTR-119674

Hope this provides more clarity now.

Gera 

View solution in original post

34 Replies
Danny
MVP Gold
MVP Gold

This post describes a similar experience with inter-VS traffic utilizing virtual switches and wrp interfaces.

The virtual switch path, processed by the VSX kernel, doesn't provide the same performance as the physical switch path.

Did you verify CPU usage for FWK during your tests with this oneliner?
The fwk_dev process, responsible for virtual switch traffic, can spike above 100% CPU usage, indicating single-core saturation. It might also not offload to SecureXL’s Fast Path or Medium Path and is subject to user-space inspection, increasing latency and CPU usage. Traffic routed internally between VSs often falls into F2F path, which is the slowest. @Timothy_Hall might be able to elaborate on this further.

Tools like mq_mng, cpview, and fw ctl affinity offer limited visibility and control over virtual switch performance.

In a VSX setup, virtual systems share the same physical resources and compete for CPU and memory bandwidth, which adds latency and reduces overall throughput.

Multi-Queue is supported in VSX environments, but its effectiveness depends on: The NIC driver and its queue capabilities, the number of available SND cores, whether multi-queue is configured in VS0 when interfaces are shared across VSs.

Summary:

Feature Physical Interface Virtual Switch (wrp)
SecureXL Fast Path Fully supported Not applied
Multi-Queue Supported Not applicable
CoreXL Scaling Multi-core ⚠️ Limited
Throughput High ⚠️ Limited
HeikoAnkenbrand
MVP Gold
MVP Gold

I understand the facts and they are clear to me.

My question is about the throughput you can realistically plan for when using WRP interfaces. In classic 1G environments it was never an issue to work with virtual switches, but the situation changes when dealing with 10G/25G/40G/100G.

Up to which throughput does Check Point recommend using virtual switches with WRP interfaces, and at what point should one switch to dedicated physical switching hardware?

Are there any recommendations from Check Point here?

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips
Timothy_Hall
MVP Gold
MVP Gold

I've never really dug into the networking guts of VSX, but the fact you are topping out at 4–5 Gbps may be significant; once the new 10Gbps interfaces hit the scene, total throughput would max out at about 5Gbps if Multi-Queue was not available or not enabled for the interface, as the SND core handling hit 100%.  It sounds to me like a single thread in fwk_dev is hitting that same single-core limit and simply can't go any faster.  

Supposedly, the warp jump between VSs will accelerate inter-VS traffic as much as possible, but there may well be limitations that cause F2F handling, which obviously will not help the CPU.  There have also been cases in the past where the Virtual Switch improperly floods all inter-VS traffic to all VSs instead of sending it only to the right one; these issues were supposedly fixed awhile back but may have reared their ugly head again now that UPPAK is on the scene, which is enabled by default on your Quantum Force 29000 I believe:  sk175113: Traffic latency when it passes through a Virtual Switch (VSW)

Other kernel variables that might be significant are cphwd_routing_interval and enable_calc_route_wrp_jump and sim_warp_jump_strict_mac.

If you are in UPPAK mode on your 29000, it might be interesting to revert to KPPAK mode and see what happens with your VSwitch performance.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course
HeikoAnkenbrand
MVP Gold
MVP Gold

@Timothy_Hall

First of all, thank you all for sharing your insights. I had already tested most of the points mentioned earlier, and they confirm what I had suspected.

It seems that the WRP interface, as well as the virtual switches, are limited to running on a single CPU core. As a result, throughput rates of around 4–5 Gbps per virtual switch instance appear to be normal under these conditions.

The challenge I am facing is that I am running two virtual systems, each of which makes use of 32 CPU cores. When using physical interfaces, I can achieve throughput rates of approximately 80 Gbps without major issues.

However, in order to connect these two virtual systems with a dedicated transfer network, I introduced a virtual switch. For this connection, I require a throughput in the range of 50–60 Gbps. Unfortunately, with virtual switches, such throughput is far out of reach given the current architecture and CPU core limitations.

This discrepancy highlights a significant bottleneck in scenarios where virtualized infrastructures are expected to handle very high data transfer rates. Addressing this limitation—either through multi-core support in virtual switching or alternative approaches to interconnectivity—will be crucial for enabling high-performance virtualized networking environments in the future.

Maybe I am overlooking something in this design, which is why I am not achieving higher throughput rates.

@_Val_@PhoneBoy 

Perhaps the R&D team at Check Point could provide some insights on this.

The question is:

What maximum throughput can virtual switches versus WRP interfaces handle in VSX?

Otherwise, I will open a ticket on the topic as an alternative.

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips
0 Kudos
Henrik_Noerr1
Advisor

We are troubleshooting performance issues for video streaming udp threads and are currently focusing on virtual switches, partly due to this thread 🙂 

During your tests, do you get a saturated core or does the throughput simply never 'pick up'.

We have VSw <> VS <> VSw with very poor performance - but are yet to narrow down the issue.

We do not see saturation on either fwk threads, snds or a single CPU core for that matter. 

r81.20 t99 vsx Open Server

/Henrik

0 Kudos
PhoneBoy
Admin
Admin

What version/JHF level?

0 Kudos
HeikoAnkenbrand
MVP Gold
MVP Gold

@PhoneBoy 

We use R81.20 JHF 105.

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips
0 Kudos
genisis__
MVP Silver
MVP Silver

I've seen similar throughput observations with VSwitches.  When switching into the VSW at the cli and running fw ctl multik stat I noted 1 kernel instance.  So if there is a way to assign multiple kernels this would likely help with throughput (not 100% on this but it would make sense).
Also I'm not sure if also applies to R82.x

0 Kudos
PhoneBoy
Admin
Admin

Given this is most likely in Legacy VSX code, I doubt R82 will address the issue.
I suspect this will require VSnext.

Christian_Froem
Participant

We also have a VSwitch environment as a large university hospital. It's very convenient that we don't have to deal with a multitude of physical interfaces to route traffic between VSX instances, but can use VSwitches instead.

However, we have users (mainly scientists) who occasionally complain that the data rates of their large transfers are poor. This seemed strange in our 10 Gbit firewall environment. But your post prompted me to take some action, because we have the same problem: we never achieve more than ~4.7 Gbit/s via the VSwitch.

Therefore, we are also curious whether this limitation can be changed in some way or whether we need to reconsider the use of VSwitches and switch to physical interfaces instead 😕

(R81.20, Take 105, CheckPoint 7000)

Wolfgang
MVP Gold
MVP Gold

@HeikoAnkenbrand did you have the possibility to do some tests with Maestro Fast Forward enabled connections ? I know that's not a solution but nice to know if this allows more throughput.

0 Kudos
HeikoAnkenbrand
MVP Gold
MVP Gold

Thanks @Wolfgang for your tip!

I also make frequent use of Fast Forward "R81.20 - Performance Tuning Tip - Maestro Fastforward " in Maestro environments, and while it’s helpful.

The main limitation I see is the throughput of the VSX virtual switches. In all environments I’ve tested, this is around 4–5 Gbps, which becomes a real challenge for very high-performance virtual systems in the 100 Gbps range.

@PhoneBoy:

The question is:

What is Check Point’s recommendation here?

Maybe R&D can provide some guidance on this.

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips
0 Kudos
genisis__
MVP Silver
MVP Silver

Do we know when R&D will respond?

Wolfgang
MVP Gold
MVP Gold

A statement from R&D would be very informative.

Chris_Atkinson
MVP Gold CHKP MVP Gold CHKP
MVP Gold CHKP

Short of testing R82 / VSNext I envisage such a statement will only come via a formal solution center request (via SE / local office).

I will ask internally and share what I learn but this is not a replacement for the correct process.

CCSM R77/R80/ELITE
0 Kudos
HeikoAnkenbrand
MVP Gold
MVP Gold

Don't be angry with me, but this is something that belongs in the manual. When I'm planning an environment for a customer, I can't ask local SE questions and wait a long time for answers.

Could you please provide information on the throughput rates we can expect with virtual switches?

Preferably a statement for classic VSX and VSNext.

I currently assume a classic VSX of approx. 4-5 GBps. Anything else I would build at the customer's site using my own physical interface with, for example, 100 GBps transceivers.

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips
PhoneBoy
Admin
Admin

In internal testing (and likely with the most recent code), we got much higher rates than the 4-5GB you are getting.
That suggests a TAC case is in order.

0 Kudos
genisis__
MVP Silver
MVP Silver

Is there any actual published throughput rates for a virtual switch?  I'm guessing the rates would vary depending on the appliance or openserver used.

In the internal testing what was the setup and what where the results?

PhoneBoy
Admin
Admin

It also depends greatly on the code level.
In any case, for performance details outside of published specs, you will need to consult with your local Check Point office.

0 Kudos
Bob_Zimmerman
MVP Gold
MVP Gold

The "most recent code" is likely the key here. Linux's network kernel has never been especially high-performance, and namespace-to-namespace performance was pretty low for a long, long time. It's only in recent kernel versions it has become acceptable for even 10g throughput. I know the warp driver isn't Linux kernel code, but it's subject to a lot of the limitations of the kernel.

Incidentally, this is a big part of why userspace networking is such a big deal on Linux, but nobody in the illumos/BSD communities really cares. FreeBSD in particular has had a dramatically faster network kernel for a long time. illumos lagged behind it for a while (still far ahead of Linux), but Crossbow pushed illumos' capabilities ahead and the performance improved to about 80% of FreeBSD's at the same time.

0 Kudos
PhoneBoy
Admin
Admin

There's a good reason IPSO was based on FreeBSD 🙂

0 Kudos
HeikoAnkenbrand
MVP Gold
MVP Gold

What throughput rates did you achieve in the lab? I would like to have this information in order to plan the correct sizing for customer projects 😉

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips
0 Kudos
PhoneBoy
Admin
Admin

Without the hardware and software used to generate such numbers, not sure how useful they are.
Best to consult with the local Check Point office.

0 Kudos
Gera_Dorfman
Employee
Employee

There should be no performance degradation. The issue is specific to UPPACK and has been identified as a bug. The fix is targeted for inclusion in one of the upcoming JHF releases. If an urgent fix is required, then please open an SR, and we’ll work on a dedicated hotfix.

Timothy_Hall
MVP Gold
MVP Gold

Thanks for the clarification, Gera. I was speculating earlier in the thread that it might be related to UPPAK.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course
0 Kudos
HeikoAnkenbrand
MVP Gold
MVP Gold

Thanks, @Gera_Dorfman  — good to hear that the issue will be fixed in the next JHFs.

Let me come back to the original question:
What is the maximum throughput in Gbps that can be achieved in the ideal case once the bug is fixed?

In practice, we’re seeing around 5 Gbps compared to your lab tests, which reach about 70 Gbps.

➜ CCSM Elite, CCME, CCTE ➜ www.checkpoint.tips
genisis__
MVP Silver
MVP Silver

(for Gera)

Can we also get the bug id for this, which should have a SK linked?

Is it listed on the jumbo HFA page as well? 

Just for clarity -  are you saying that performance observations (does not realistic go above 4 - 5GB) when using a virtual switch is attributed to a bug?  If so this must be an old bug and and I've never seen that defined as a bug before.
Sorry to ask the above but I think its important we understand a bit more here. 

0 Kudos
Gera_Dorfman
Employee
Employee

@genisis__ @HeikoAnkenbrand 

Answering all questions in one post 

- The virtual switch should not impact performance at all. If you achieve 100G without the virtual switch, you should expect the same performance with the virtual switch. The current degradation issue lies in the UPPACK implementation of the virtual switch and we need to fix it.


- The fix will be included in one of the upcoming Jumbo releases, but not necessarily the next one. If an urgent fix is needed, please contact support, and we'll arrange a hotfix.

- The bug to track is PMTR-119674

Hope this provides more clarity now.

Gera 

Jens_Groth_Andr
Participant

Thank you @Gera_Dorfman. This does provide some clarity, especially with the issue in the UPPAK implementation.

However, it's not clear to me if a config change to KPPAK was tried in the setup behind @HeikoAnkenbrand's original post. And also not clear to me if the observations reported by @genisis__were with UPPAK or KPPAK.

For the more detailed report from @Christian_Froem I suppose (with Check Point 7000) setup is KPPAK.
We see indications of similar VSW bottleneck of approx. 5 to 6 Gbit/s on our Check Point 28000 with R81.10 take 174.

We hope to be able to do comparison tests with/without VSW and perhaps other tweaks in the near future.

This also brings our interest to a question by @genisis__ which I believe was not answered (sorry @Timothy_Hall and others if I missed something):
    When switching into the VSW at the cli and running fw ctl multik stat I noted 1 kernel instance. 
    So if there is a way to assign multiple kernels this would likely help with throughput ...

 
Any advice or pointers are highly appreciated. (Our upcoming tests will be on R81.10 take 181.)

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events