Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Kaspars_Zibarts
Authority
Authority

Problems with large VSX platforms running R80.40 take 94

This is just heads up if you are running VSX with more than 4 SND / MQ cores and fairly high traffic volumes.

Please do not upgrade to R80.40 with jumbo less than 100 (fix is already included there) or request a portfix.

We upgraded from R80.30 take 219 to R80.40 take 94 and faced multiple issues:

  • high CPU usage on all SND / MQ cores, spiking to 100% every half a minute
  • packet loss on traffic passing through FW
  • RX packet drops on most interfaces
  • virtual router fwk was running constantly 100%
  • clustering was reporting Sync interface down in fwk.elg

I don't want to go into the root cause details but lets say there is a miscalculation for resources required to handle info between SND and FWK workers. That's corrected in JHF 100 or above.

I just wanted to say that Checkpoint R&D was outstanding today, jumping on the case and working till it was resolved! Really impressed, I wish I could release names here but that would be inappropriate. 

 

20 Replies
MatanYanay
Employee
Employee

Hi @Kaspars_Zibarts 

Thank you for the information and heads up  

I just wanted to update you, PRJ-15447 already been released and its part of our latest ongoing  (take 100)  

Matan.

Kaspars_Zibarts
Authority
Authority

@MatanYanay not that great news this morning - virtual router CPU overload is back this morning including packet loss. You can see it's been growing gradually since the fix. I have failed over to standby VSX and it seems to be holding for now. Need your top guns back! 🙂

image.png

Henrik_Noerr1
Collaborator

Keep us updated @Kaspars_Zibarts 🙂 We have 4 large vsx clusters on open servers r80.40 take 91.

I can't say we experience massive drops - But I am always suspicious. The load is unexpectedly high and r80.40 has in general been an up hill struggle.

/Henrik

0 Kudos
genisis__
Advisor

Are you seeing high CPU utilisation in comparison to low throughput?  If so this matches what I see.

0 Kudos
Henrik_Noerr1
Collaborator

@MatanYanay -  PRJ-15447 is not mentioned in the JHF. Can you please elaborate.

/Henrik

0 Kudos
Kaspars_Zibarts
Authority
Authority

R&D confirmed that it is included @Henrik_Noerr1 

quick check: 

fw ctl get int fwmultik_gconn_segments_num

should return number of SND cores. When unpatched, it will return 4.

In all honesty - we have two other VSX clusters running VSLS and approx 10Gbps and we do not see issues there.

The difference I see is that this particular VSX is HA instead of VSLS and we have VR (virtual router) interconnecting nearly all VSes and carrying fairly high volume of traffic

0 Kudos
Henrik_Noerr1
Collaborator

Yeah - we get 4 on all nodes.

Usercenter unsuprisingly yelds no result for fwmultik_gconn_segments_num

Any info? I'm feeling too lazy creating a SR just before my Easter vacation 🙂

 

/Henrik

0 Kudos
Kaspars_Zibarts
Authority
Authority

indeed, this is one of those buried deep into system conf things. I just know the required values. If any of R&D guys want to expand on it, I'll let them do it. Don't want to spread incorrect or forbidden info! 🙂 

0 Kudos
genisis__
Advisor

I just ran this on my system and I get the following:

# fw ctl get int fwmultik_gconn_segments_num
fwmultik_gconn_segments_num = 4

0 Kudos
genisis__
Advisor

Implemented JHFA102 and the above parameter reports correctly, strangely I had to reboot the node twice.

Also SNMPv3 user seemed to screw up, I had to delete and then add this back in or SNMP details would not be discovered.

In var/log/messages I have been seeing lots of the following since implementing the Jumbo:

kernel: dst_release: dst:xxxxxxxxxxxx refcnt:-xx

The above is explained in sk166363, which implies it can be safely ignored. This said I'm still going to ask TAC about it t cover myself.

So far I also believe CPU usage pattern has changed for the better but will tell tomorrow during a working day.

Kaspars_Zibarts
Authority
Authority

we rolled back ours after 2 week struggle with CPU issues. Some of them later transpired to be unrelated to the upgrade. So another attempt is in planning. Still suspect our VSX setup: HA with Virtual router that's different to other VSX clusters we have (VSLS, no VR). Plus the load: 30Gbps vs 10Gbps

0 Kudos
genisis__
Advisor

Its interesting because I suspect a VSW issue and in yours a VR issue, both would use wrp links. and then have a single physical interface leading to the external network.

0 Kudos
Kaspars_Zibarts
Authority
Authority

My other two VSX clusters both have VSW and they work just fine :). I suspect combination of HA cluster type + VR 🙂

0 Kudos
genisis__
Advisor

more importantly update the SK to reflect everything fixed in the Jumbo.

0 Kudos
genisis__
Advisor

I wonder if this is also playing into my long standing issue with performance!  I know we have discussed this in another thread.  

I also found the cppcap is broken in R80.40 with JHFA91 and fixed in JHFA100 (or get an update rpm).  I also suspect tcpdump is not quite working.

I found when running either of these  I seem to only see one way traffic, example if I ping from a workstation the ping works, however tcpdump or cppcap report  echo-request traffic, echo-reply is never seen (for tcpdump securexl was turned off, cppcap does not need to have securexl turned off).

I think Checkpoint need to slow down with there releases and really focus on reducing the bugs. This will help all of us, including TAC!

0 Kudos
Timothy_Hall
Champion
Champion

The inability to see two-way traffic with tcpdump/cppcap in VSX may be due to a little-known SecureXL feature known as "warp jump", this was mentioned in my Max Capture video series and is also mentioned here:

sk167462: Tcpdump / CPpcap do not show incoming packets on Virtual Switch's Wrp interface

fw monitor -F should be able to successfully capture this traffic.

New 2021 IPS/AV/ABOT Self-Guided Video Series
now available at http://www.maxpowerfirewalls.com
0 Kudos
genisis__
Advisor

Thanks Tim.

I do have a TAC case raised for this and provided examples for both tcpdump and cppcap, but it does sounds like the SK you have mentioned.

0 Kudos
Kaspars_Zibarts
Authority
Authority

whilst you're here @Timothy_Hall 🙂

had a very bizarre case of NFSD-TCP (2049) connections not being accelerated - it went partially via F2F and partially via PXL. After adding "plain" port service instead of CP pre-defined, acceleration kicked in. Else 3 out of 20 FWK workers were running flat out as 7 servers generated 5Gbps to a single destination server. Never seen this before with port 2049 and R&D are equally puzzled. We actually rolled back to R80.30 as we suspected upgrade was the root cause but it turned out that some application changes were done same night as our upgrade that started totally new flow to port 2049.

0 Kudos
Timothy_Hall
Champion
Champion

Hmm, any chance anti-bot and/or anti-virus were enabled?  While this issue has been fixed long ago it sounds suspiciously familiar to this:

sk106062: CPU load and traffic latency after activating Anti-Bot and/or Anti-Virus blade on Security...

Also on the original nfsd-tcp service you were using, had any settings been overridden on the Advanced screen?  I've seen that fully or partially kill acceleration whereas setting it back to "Use default settings" fixes it.  You may have been able to achieve the same effect by creating a new TCP/2049 service with the default settings on the Advanced screen, or perhaps the original built-in nfsd-tcp service has some kind of special inspection hooks in it (even though the protocol type is "None") and you avoided those with the new TCP/2049 service.

New 2021 IPS/AV/ABOT Self-Guided Video Series
now available at http://www.maxpowerfirewalls.com
0 Kudos
Kaspars_Zibarts
Authority
Authority

We only use FW and IA blades on this VS. Nothing fancy. NFSD-TCP seems original and untouched as far as I can see 🙂

image.pngimage.png

0 Kudos