Problems with large VSX platforms running R80.40 t...

Kaspars_Zibarts · ‎2021-03-25

This is just heads up if you are running VSX with more than 4 SND / MQ cores and fairly high traffic volumes.

Please do not upgrade to R80.40 with jumbo less than 100 (fix is already included there) or request a portfix.

We upgraded from R80.30 take 219 to R80.40 take 94 and faced multiple issues:

high CPU usage on all SND / MQ cores, spiking to 100% every half a minute
packet loss on traffic passing through FW
RX packet drops on most interfaces
virtual router fwk was running constantly 100%
clustering was reporting Sync interface down in fwk.elg

I don't want to go into the root cause details but lets say there is a miscalculation for resources required to handle info between SND and FWK workers. That's corrected in JHF 100 or above.

I just wanted to say that Checkpoint R&D was outstanding today, jumping on the case and working till it was resolved! Really impressed, I wish I could release names here but that would be inappropriate.

MatanYanay · ‎2021-03-25

Hi @Kaspars_Zibarts

Thank you for the information and heads up

I just wanted to update you, PRJ-15447 already been released and its part of our latest ongoing (take 100)

Matan.

Kaspars_Zibarts · ‎2021-03-26

@MatanYanay not that great news this morning - virtual router CPU overload is back this morning including packet loss. You can see it's been growing gradually since the fix. I have failed over to standby VSX and it seems to be holding for now. Need your top guns back! 🙂

Henrik_Noerr1 · ‎2021-03-26

Keep us updated @Kaspars_Zibarts 🙂 We have 4 large vsx clusters on open servers r80.40 take 91.

I can't say we experience massive drops - But I am always suspicious. The load is unexpectedly high and r80.40 has in general been an up hill struggle.

/Henrik

genisis__ · ‎2021-04-04

Are you seeing high CPU utilisation in comparison to low throughput? If so this matches what I see.

Henrik_Noerr1 · ‎2021-03-26

@MatanYanay - PRJ-15447 is not mentioned in the JHF. Can you please elaborate.

/Henrik

Kaspars_Zibarts · ‎2021-03-26

R&D confirmed that it is included @Henrik_Noerr1

quick check:

fw ctl get int fwmultik_gconn_segments_num

should return number of SND cores. When unpatched, it will return 4.

In all honesty - we have two other VSX clusters running VSLS and approx 10Gbps and we do not see issues there.

The difference I see is that this particular VSX is HA instead of VSLS and we have VR (virtual router) interconnecting nearly all VSes and carrying fairly high volume of traffic

Henrik_Noerr1 · ‎2021-03-26

Yeah - we get 4 on all nodes.

Usercenter unsuprisingly yelds no result for fwmultik_gconn_segments_num

Any info? I'm feeling too lazy creating a SR just before my Easter vacation 🙂

/Henrik

Kaspars_Zibarts · ‎2021-03-26

indeed, this is one of those buried deep into system conf things. I just know the required values. If any of R&D guys want to expand on it, I'll let them do it. Don't want to spread incorrect or forbidden info! 🙂

genisis__ · ‎2021-04-05

I just ran this on my system and I get the following:

# fw ctl get int fwmultik_gconn_segments_num
fwmultik_gconn_segments_num = 4

genisis__ · ‎2021-04-17

Implemented JHFA102 and the above parameter reports correctly, strangely I had to reboot the node twice.

Also SNMPv3 user seemed to screw up, I had to delete and then add this back in or SNMP details would not be discovered.

In var/log/messages I have been seeing lots of the following since implementing the Jumbo:

kernel: dst_release: dst:xxxxxxxxxxxx refcnt:-xx

The above is explained in sk166363, which implies it can be safely ignored. This said I'm still going to ask TAC about it t cover myself.

So far I also believe CPU usage pattern has changed for the better but will tell tomorrow during a working day.

Kaspars_Zibarts · ‎2021-04-19

we rolled back ours after 2 week struggle with CPU issues. Some of them later transpired to be unrelated to the upgrade. So another attempt is in planning. Still suspect our VSX setup: HA with Virtual router that's different to other VSX clusters we have (VSLS, no VR). Plus the load: 30Gbps vs 10Gbps

genisis__ · ‎2021-04-19

Its interesting because I suspect a VSW issue and in yours a VR issue, both would use wrp links. and then have a single physical interface leading to the external network.

Kaspars_Zibarts · ‎2021-04-19

My other two VSX clusters both have VSW and they work just fine :). I suspect combination of HA cluster type + VR 🙂

genisis__ · ‎2021-04-04

more importantly update the SK to reflect everything fixed in the Jumbo.

genisis__ · ‎2021-04-04

I wonder if this is also playing into my long standing issue with performance! I know we have discussed this in another thread.

I also found the cppcap is broken in R80.40 with JHFA91 and fixed in JHFA100 (or get an update rpm). I also suspect tcpdump is not quite working.

I found when running either of these I seem to only see one way traffic, example if I ping from a workstation the ping works, however tcpdump or cppcap report echo-request traffic, echo-reply is never seen (for tcpdump securexl was turned off, cppcap does not need to have securexl turned off).

I think Checkpoint need to slow down with there releases and really focus on reducing the bugs. This will help all of us, including TAC!

Timothy_Hall · ‎2021-04-05

The inability to see two-way traffic with tcpdump/cppcap in VSX may be due to a little-known SecureXL feature known as "warp jump", this was mentioned in my Max Capture video series and is also mentioned here:

sk167462: Tcpdump / CPpcap do not show incoming packets on Virtual Switch's Wrp interface

fw monitor -F should be able to successfully capture this traffic.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

genisis__ · ‎2021-04-05

Thanks Tim.

I do have a TAC case raised for this and provided examples for both tcpdump and cppcap, but it does sounds like the SK you have mentioned.

Kaspars_Zibarts · ‎2021-04-06

whilst you're here @Timothy_Hall 🙂

had a very bizarre case of NFSD-TCP (2049) connections not being accelerated - it went partially via F2F and partially via PXL. After adding "plain" port service instead of CP pre-defined, acceleration kicked in. Else 3 out of 20 FWK workers were running flat out as 7 servers generated 5Gbps to a single destination server. Never seen this before with port 2049 and R&D are equally puzzled. We actually rolled back to R80.30 as we suspected upgrade was the root cause but it turned out that some application changes were done same night as our upgrade that started totally new flow to port 2049.

Timothy_Hall · ‎2021-04-06

Hmm, any chance anti-bot and/or anti-virus were enabled? While this issue has been fixed long ago it sounds suspiciously familiar to this:

sk106062: CPU load and traffic latency after activating Anti-Bot and/or Anti-Virus blade on Security...

Also on the original nfsd-tcp service you were using, had any settings been overridden on the Advanced screen? I've seen that fully or partially kill acceleration whereas setting it back to "Use default settings" fixes it. You may have been able to achieve the same effect by creating a new TCP/2049 service with the default settings on the Advanced screen, or perhaps the original built-in nfsd-tcp service has some kind of special inspection hooks in it (even though the protocol type is "None") and you avoided those with the new TCP/2049 service.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Kaspars_Zibarts · ‎2021-04-06

We only use FW and IA blades on this VS. Nothing fancy. NFSD-TCP seems original and untouched as far as I can see 🙂

Are you a member of CheckMates?

Problems with large VSX platforms running R80.40 take 94