Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Magnus-Holmberg
Advisor

VSX boot time

Hi,

So the boot times are getting more and more crazy on large vsx systems.
Within R80.30 3.10, and 41 VS we needed about 15min (uptime of the node) for all VS to be started.
Within R80.40 we now need 47 min to boot the same cluster nodes, it takes about 30min before the VS even start to come up.
Boot time of the actual physical box is excluded from this time, this is from when you are able to login via SSH.

Is there a plan to limit this or is the actual practical limitation of VSX <25 VS per cluster
After 25 VS on there are a lot of pain points,

Regards,
Magnus

https://www.youtube.com/c/MagnusHolmberg-NetSec
0 Kudos
15 Replies
Chris_Atkinson
Employee Employee
Employee

For context what's the underlying system/hardware config:  XFS, SSD etc?

CCSM R77/R80/ELITE
0 Kudos
Magnus-Holmberg
Advisor

Open Server, HPE DL360G10, 
2 x Xeon Gold 5122, 192Gb ram
2 x 480Gb SSD Raid1

Clean install upgrade process, less than 500Mbit traffic, <50.000 connections, <10% CPU load
R80.40, HFA173

https://www.youtube.com/c/MagnusHolmberg-NetSec
Chris_Atkinson
Employee Employee
Employee

@Magnus-Holmberg What is your current experience like, has there been any noticeable change?

CCSM R77/R80/ELITE
0 Kudos
Henrik_Noerr1
Advisor

We have the exact same issue. Boot times is more than an hour and really hurting our patch cycles, and general trust when downtime occurs.

 

When the clusters are available for ssh login no Virtual Systems are online for maybe 30 minutes.

After this Virtual Systems are loaded sequentially for another 30 minutes.

So in total we use 1h plus to boot a node. We see no performance related issues with top, iotop or other debugging tools.

Actually the box is idle until policies start loading. We do see that the process 'cgroup' is intermittently using some CPU.

We mostly use Lenovo 850p servers with 32 cores, and 256gb ram running a mix of r80.40 and r81.10

 

Not sure why I have not created a TAC case already. I will do that.

 

0 Kudos
Magnus-Holmberg
Advisor

It has improved alot in later HFA on R81.10 and R80.40 for our smaller clusters with <20 vs. more than 50% reduction in boot time.
Going to upgrade large clusters within some weeks to see how it is when 50vs on them

https://www.youtube.com/c/MagnusHolmberg-NetSec
Magnus-Holmberg
Advisor

Upgrading to R81.10 with clean install and added HFA129 on a VSX with 42 VS now.
And the VSX reconfigure takes 1hour, in R80.30 3.10 same would take ~15min

VSX_prov.jpg

https://www.youtube.com/c/MagnusHolmberg-NetSec
0 Kudos
Magnus-Holmberg
Advisor

In regards to restarting a VSX Node.
Its 47min before the VS are starting and additional ~25min for the VS to start.
So i would say its a big problem.

VS_starting.jpg

https://www.youtube.com/c/MagnusHolmberg-NetSec
0 Kudos
Magnus-Holmberg
Advisor

Tested R81.10 HFA132 and box was booted including all VS started within 15min
But here the CPU is something wrong with, without traffic its 100% loaded, however it goes down after 20min
And the box that has the traffic is less than 18% (on the none upgraded member)

VSX_CPU_LOADED.jpg

https://www.youtube.com/c/MagnusHolmberg-NetSec
0 Kudos
Chris_Atkinson
Employee Employee
Employee

We see PRJ-45520 incorporated into the recent JHFs.

Did the CPU trend continue or have you rolled it back?

 

 

CCSM R77/R80/ELITE
0 Kudos
Magnus-Holmberg
Advisor

Reverted because everything takes to longtime for it to fit in a 6 hour servicewindow.

Having boxes taking 1h to boot dosn´t work because it gives no time for tshoot.
So put on HFA versions that we know works from before in smaller clusters.

https://www.youtube.com/c/MagnusHolmberg-NetSec
vbrozik
Explorer

If you have a high number of network interfaces and the MAC address of some of them starts with a letter (a-f, not 0-9) then it might be a bug we identified.

We encountered the problem on a system with about 400 VLAN interfaces. There the bug delays start of the virtual systems after the OS boot by about 40 minutes! All this additional time is spend by process fwaffinity_apply by repeated nonsensical calls of ctl affinity (each taking about 6 seconds).

At the moment the fix is a private HF PRHF-34015 on top of R81.10 JHF 130 and a HF on top of JHF 150 is currently in preparation.

(1)
Henrik_Noerr1
Advisor

Thank you vbozik! I was really chasing this.

Will ask our diamond engineer where we are with r81.20 integration.

 

/Henrik

0 Kudos
PhoneBoy
Admin
Admin

Boot time is one of those things that should improve with VSNext and R82.

0 Kudos
Magnus-Holmberg
Advisor

well this is however not normal behavior as it has changed between HFA versions.
VSNext aslo requires clean install.

https://www.youtube.com/c/MagnusHolmberg-NetSec
0 Kudos
PhoneBoy
Admin
Admin

Among other things, yes 🙂
Having said that, if this behavior significantly changed between JHF takes, that definitely warrants a TAC investigation.

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events