26000 appliance - first impressions after upgrade ...

Kaspars_Zibarts · ‎2020-04-22

Weekend gone we retired our "jumbo jets" (aka 41k chassis) in favour of 26k appliances running R80.30 take 155 and 3.10 kernel.

Main drive were the limitations presented by scalable platform SP :

we wanted to run virtual router in our VSX and that's not supported in SP
we are not able to upgrade to R80.xSP train for multiple limitations therefore in R76SP50 we:
- were lacking FQDN objects
- were lacking Updatable objects
- had IA performance issues / CPU usage
- had 32bit kernel limitations
troubleshooting cases with flow corrections between SGM blades proved to be cumbersome and tweaking a box would take considerable amount of hours in our environment. To give one example we had a case when certain TCP RST conditions were met, connection got "stuck" in corrected SGM and never released ending up in consuming all high ports for a particular client connections and eventually stopping traffic from that host (it was load balancer representing many clients so high ports were frequently reused)
most importantly lack of freely available administrators that were able to maintain SP platform - we could not find anyone to replace admins that left our team. This created a major threat to contingency plans

So the decision was made to move back to appliances. At the time we were seeing total backplane traffic in the chassis across all SGMs closing to 40Gbps. So we had no choice but to get 26000 which promised 100Gbps pure FW throughput:

Well. I guess all these years working with CP, you need to take all the numbers with pinch of salt.

What's the reality after running with it for couple of days? I personally don't think we will be able to push it beyond 60Gbps.

Why, the red line in the screenshot below represents time stamp when:

total throughput was just under 20Gbps
total concurrent connections was just over 1M
total connection rate was 30k cps
acceleration across all VSes was > 90%
only FW and IA blades used

If I take average CPU usage across all 72 cores it works out approx 30% so I could "guestimate" that we could triple performance of the box before it maxes out in our environment:

Was it worth the upgrade as 41k easily would have done the same job performance wise or even better?

I have to admit that (lack of) SP platform admin availability plus the gain of FQDN and updatable objects and Virtual Router justified it for us. It might be short-lived due to capacity limitations but puts us in right direction. I have to admit that I'm glad to be moving away from SP. I don't think it's mature enough, just like VSX as a product was 7 years ago before reaching R77.30.

I would love to hear from those using VSX on open server HW and running R80.30 - how's the performance on those? I have heard that open servers outperform appliances by huge mile by using faster CPUs.

Kaspars_Zibarts · ‎2020-05-06

Quick update on this topic.

After chain of some unfortunate incidents I managed to discover that large portion of "average" CPU usage on most VSes was attributed to PDPD or Identity Awareness. This was first platform where I decided not to split out IA processes to dedicated cores and it backfired...

To keep it short: PDP was running 90%+ CPU load on VSes that had IA enabled and since it was allowed to run on all cores associated with particular VS, it drifted around and artificially created an impression that firewall is running higher CPU than expected.

So couple of steps were taken to address this:

assign PDPD to run on a dedicated CPU cores so we can isolate firewalling from IA
disable nested groups (pdp nested_groups disable) - that was the main culprit for high PDPD CPU actually. With event rate over 2000/min arriving from IDC and nested groups enabled, PDPD run nearly 100%. After disabling it and running 7000 events/min from IDC, PDPD was at comfortable 25% CPU load
disable full role update after policy push (pdp __reconf_update_all disable) - again, in our environment it pushed PDPD to 100% for over 10mins after policy installation

After all the fixes above, I can say that overall Gen2 FW performance will meet Checkpoint published figures (100Gbps) 🙂 Or at least close to it

Timothy_Hall · ‎2020-05-07

Thanks for the update Kaspars, sounds like a use case for the Identity Collector depending on the size of the domain. With a firewall operating in kernel mode (i.e. non-VSX and non-USFW), processes like pdpd would always lose to the kernel when it came to CPU access, and it was easy to distinguish us vs. sy/si CPU execution time. But with USFW enabled (and of course running in VSX mode), now they are on more equal footing in regards to CPU access in "us" mode/process space thus causing the misleading situtation you observed.

One question: Are the fwk's and pdpd daemons running with standard, equal CPU scheduling priority on your system? Or is one of them nice'd down to a lower priority?

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Are you a member of CheckMates?

26000 appliance - first impressions after upgrade from 41k chassis